* GNAT compiler switches and optimization @ 2006-10-20 10:47 tkrauss 2006-10-20 11:04 ` Duncan Sands ` (12 more replies) 0 siblings, 13 replies; 68+ messages in thread From: tkrauss @ 2006-10-20 10:47 UTC (permalink / raw) I'm a bit stuck trying to figure out how to coax more performance out of some Ada code. I suspect there is something simple (like compiler switches) but I'm missing it. As an example I'm using a simple matrix multiply and comparing it to similar code in Fortran. Unfortunately the Ada code takes 3-4 times as long. I'm using GNAT (GPL 2006) and GFortran (4.2.0) and the following compile options: gnat make -O3 -gnatp tst_array gfortran -O3 tst_array.f95 Running them on 800x800 matrices (on my 2GHz laptop) for Ada: "tst_array 800" runs in 18 seconds for Fortran "tst_array 800" runs in 6 seconds (if I use the fortran "matmul" intrinsic the fortran time drops to 2.5 seconds) Note, I tried reordering the loops, removing the random calls, etc. none of which made huge changes. There is something killing performance and/or a switch or two that I'm missing, but I can't seem to find it. Any thoughts? Here is the code: -- tst_array.adb with Ada.Numerics.Float_Random; use Ada.Numerics.Float_Random; with Ada.Command_Line; use Ada.Command_Line; with Ada.Text_IO; use Ada.Text_IO; with Ada.Calendar; use Ada.Calendar; procedure tst_array is package F_IO is new Ada.Text_IO.Float_IO(Float); package D_IO is new Ada.Text_Io.Fixed_Io(Duration); type Real_Matrix is array(Integer range <>,Integer range <>) of Float; type Matrix_Access is access Real_Matrix; N : Positive; G : Ada.Numerics.Float_Random.Generator; A,B,C : Matrix_Access; Start, Finish : Ada.Calendar.Time; Sum : Float := 0.0; begin N := Positive'Value (Argument (1)); Start := Ada.Calendar.Clock; A := new Real_Matrix(1..N, 1..N); B := new Real_Matrix(1..N, 1..N); C := new Real_Matrix(1..N, 1..N); for I in A'range(1) loop for J in A'range(2) loop A(I,J) := Ada.Numerics.Float_Random.Random(G); B(I,J) := Ada.Numerics.Float_Random.Random(G); end loop; end loop; for I in A'range(1) loop for J in A'range(2) loop Sum := 0.0; for R in A'range(2) loop Sum := Sum + A(I,R)*B(R,J); end loop; C(I,J) := Sum; end loop; end loop; Finish := Ada.Calendar.Clock; F_IO.Put (C(1,1)); F_IO.Put (C(1,2)); New_Line; F_IO.Put (C(2,1)); F_IO.Put (C(2,2)); New_Line; Finish := Ada.Calendar.Clock; Put ("Time: "); D_IO.Put(Finish-Start); New_Line;New_Line; end tst_array; -- tst_array.f95 program test_array integer,parameter :: seed = 86456 integer :: N integer :: numArgs real, dimension(:,:), allocatable :: a, b, c real, dimension(:,:), allocatable :: d, e, f real :: sum character*100 buffer real :: begin, finish integer :: I, J, R call getarg(1,buffer) read(buffer,*) N call srand(seed) begin = second() allocate( a(N,N) ) allocate( b(N,N) ) allocate( c(N,N) ) forall (I = 1:N, J = 1:N) a(i,j) = rand() b(i,j) = rand() end forall !c = matmul(a, b) do I = 1,N do J = 1,N sum = 0.0 do R = 1,N sum = sum + A(I,R)*B(R,J) end do C(I,J) = sum end do end do print *, c(1,1), c(1,2), c(2,1), c(2,2) finish = second() print *, 'Time: ', (finish-begin) end program test_array ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss @ 2006-10-20 11:04 ` Duncan Sands 2006-10-21 10:45 ` Stephen Leake 2006-10-20 11:42 ` Duncan Sands ` (11 subsequent siblings) 12 siblings, 1 reply; 68+ messages in thread From: Duncan Sands @ 2006-10-20 11:04 UTC (permalink / raw) To: comp.lang.ada; +Cc: tkrauss On Friday 20 October 2006 12:47, tkrauss wrote: > I'm a bit stuck trying to figure out how to coax more performance > out of some Ada code. I suspect there is something simple (like > compiler switches) but I'm missing it. As an example I'm using > a simple matrix multiply and comparing it to similar code in > Fortran. Unfortunately the Ada code takes 3-4 times as long. GNAT GPL 2006 is based on gcc 3.4.6. For fortran you are using gcc 4.2.0. Try using comparable compiler versions, eg: an Ada aware gcc 4.2.0 (several linux distributions provide this) or a gcc 3.4.6 version of fortran (i.e. some version of g77). Ciao, Duncan. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 11:04 ` Duncan Sands @ 2006-10-21 10:45 ` Stephen Leake 0 siblings, 0 replies; 68+ messages in thread From: Stephen Leake @ 2006-10-21 10:45 UTC (permalink / raw) Duncan Sands <duncan.sands@math.u-psud.fr> writes: > On Friday 20 October 2006 12:47, tkrauss wrote: >> I'm a bit stuck trying to figure out how to coax more performance >> out of some Ada code. I suspect there is something simple (like >> compiler switches) but I'm missing it. As an example I'm using >> a simple matrix multiply and comparing it to similar code in >> Fortran. Unfortunately the Ada code takes 3-4 times as long. > > GNAT GPL 2006 is based on gcc 3.4.6. For fortran you are using > gcc 4.2.0. Try using comparable compiler versions, eg: an Ada > aware gcc 4.2.0 (several linux distributions provide this) or > a gcc 3.4.6 version of fortran (i.e. some version of g77). I have a pre-release of GNAT 5.05, which is based on gcc 4.2. I tried tst_array with both -O2 and -O3; -O2 was faster. AdaCore says -O3 has "dangerous experimental optimizations; don't use it". Here's a comparison: GNAT 5.04, -O2 -gnatp: ./tst_array.exe 800 Init Time: 0.557314686 Mult Time: 20.139092538 1.99202E+02 1.96854E+02 1.88844E+02 1.84498E+02 GNAT 5.05, -O2 -gnatp: ./tst_array.exe 800 Init Time: 0.575690486 Mult Time: 12.160329316 1.99202E+02 1.96854E+02 1.88844E+02 1.84498E+02 Note that I timed the initialize with random separate from the matrix multiply; it could easily be that the random number generators are different. But that's a small effect. I don't have gfortran installed. -- -- Stephe ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss 2006-10-20 11:04 ` Duncan Sands @ 2006-10-20 11:42 ` Duncan Sands 2006-10-20 15:41 ` Martin Krischik 2006-10-20 12:09 ` Samuel Tardieu ` (10 subsequent siblings) 12 siblings, 1 reply; 68+ messages in thread From: Duncan Sands @ 2006-10-20 11:42 UTC (permalink / raw) To: comp.lang.ada; +Cc: tkrauss On Friday 20 October 2006 12:47, tkrauss wrote: > I'm a bit stuck trying to figure out how to coax more performance > out of some Ada code. I suspect there is something simple (like > compiler switches) but I'm missing it. As an example I'm using > a simple matrix multiply and comparing it to similar code in > Fortran. Unfortunately the Ada code takes 3-4 times as long. GNAT GPL 2006 is based on gcc 3.4.6. For fortran you are using gcc 4.2.0. Try using comparable compiler versions, eg: an Ada aware gcc 4.2.0 (several linux distributions provide this) or a gcc 3.4.6 version of fortran (i.e. some version of g77). Ciao, Duncan. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 11:42 ` Duncan Sands @ 2006-10-20 15:41 ` Martin Krischik 0 siblings, 0 replies; 68+ messages in thread From: Martin Krischik @ 2006-10-20 15:41 UTC (permalink / raw) Duncan Sands wrote: > On Friday 20 October 2006 12:47, tkrauss wrote: >> I'm a bit stuck trying to figure out how to coax more performance >> out of some Ada code. I suspect there is something simple (like >> compiler switches) but I'm missing it. As an example I'm using >> a simple matrix multiply and comparing it to similar code in >> Fortran. Unfortunately the Ada code takes 3-4 times as long. > > GNAT GPL 2006 is based on gcc 3.4.6. For fortran you are using > gcc 4.2.0. Try using comparable compiler versions, eg: an Ada > aware gcc 4.2.0 (several linux distributions provide this) or > a gcc 3.4.6 version of fortran (i.e. some version of g77). Indeed. GNAT/GPL: ./Linux-x86_64-Release/Test_Array_1 800 1.99202E+02 1.96854E+02 1.88844E+02 1.84498E+02 Time: 12.265258000 GNAT/GCC: ./Linux-x86_64-Release/Test_Array_1 800 1.99202E+02 1.96854E+02 1.88844E+02 1.84498E+02 Time: 10.631787000 So who said the 4.1.x compiler are slower... Martin -- mailto://krischik@users.sourceforge.net Ada programming at: http://ada.krischik.com ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss 2006-10-20 11:04 ` Duncan Sands 2006-10-20 11:42 ` Duncan Sands @ 2006-10-20 12:09 ` Samuel Tardieu 2006-10-20 12:18 ` Samuel Tardieu 2006-10-20 12:12 ` Gautier ` (9 subsequent siblings) 12 siblings, 1 reply; 68+ messages in thread From: Samuel Tardieu @ 2006-10-20 12:09 UTC (permalink / raw) >>>>> "tkrauss" == tkrauss <thomas.krauss@gmail.com> writes: tkrauss> Running them on 800x800 matrices (on my 2GHz laptop) tkrauss> for Ada: "tst_array 800" runs in 18 seconds for Fortran tkrauss> "tst_array 800" runs in 6 seconds tkrauss> (if I use the fortran "matmul" intrinsic the fortran time tkrauss> drops to 2.5 seconds) tkrauss> Note, I tried reordering the loops, removing the random tkrauss> calls, etc. none of which made huge changes. There is tkrauss> something killing performance and/or a switch or two that I'm tkrauss> missing, but I can't seem to find it. Any thoughts? First of all, what you measure is not only the matrix multiplication time but also the operation of filling the matrices with random numbers. I've moved the "start" initialization after the matrices initialization. The following optimizations make the difference smaller (9.47 seconds for Fortran vs. 11.90 seconds for Ada on my machine): - use -fomit-frame-pointer on gnatmake command line (this doesn't change anything in the Fortran case) - add: pragma Convention (Fortran, Real_Matrix) to invert the storage method (line vs. column); I guess this helps maintaining more data in the cache - use 1 .. N as loop indices instead of A'Range (1) and friends; this is more equivalent to the Fortran code you posted Still, this is a huge penaly for Ada. Unfortunately, I don't have the time to investigate further right now. However, I would be interested in other people findings. Sam -- Samuel Tardieu -- sam@rfc1149.net -- http://www.rfc1149.net/ ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 12:09 ` Samuel Tardieu @ 2006-10-20 12:18 ` Samuel Tardieu 0 siblings, 0 replies; 68+ messages in thread From: Samuel Tardieu @ 2006-10-20 12:18 UTC (permalink / raw) >>>>> "Sam" == Samuel Tardieu <sam@rfc1149.net> writes: Sam> Still, this is a huge penaly for Ada. Unfortunately, I don't have Sam> the time to investigate further right now. However, I would be Sam> interested in other people findings. Oh, also this one lowers the Ada execution time by around 10%: do not use an unconstrained type and do not use pointers. N : constant Positive := Positive'Value (Argument (1)); G : Ada.Numerics.Float_Random.Generator; type Real_Matrix is array(1 .. N, 1 .. N) of Float; pragma Convention (Fortran, Real_Matrix); A,B,C : Real_Matrix; And as Duncan says, you should try with a newer backend for Ada. Sam -- Samuel Tardieu -- sam@rfc1149.net -- http://www.rfc1149.net/ ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss ` (2 preceding siblings ...) 2006-10-20 12:09 ` Samuel Tardieu @ 2006-10-20 12:12 ` Gautier 2006-10-20 12:35 ` Dmitry A. Kazakov ` (8 subsequent siblings) 12 siblings, 0 replies; 68+ messages in thread From: Gautier @ 2006-10-20 12:12 UTC (permalink / raw) tkrauss: > I'm a bit stuck trying to figure out how to coax more performance > out of some Ada code. I suspect there is something simple (like > compiler switches) but I'm missing it. As an example I'm using > a simple matrix multiply and comparing it to similar code in > Fortran. Unfortunately the Ada code takes 3-4 times as long. One thing is already that you start the timer at the wrong place. You should start it after filling your array with random numbers. In the present state you compare the random generator, then your matrix multiplication. It is possible that Ada.Numerics.Float_Random.Random takes significantly more time due to Ada's quality requirements in the random generation. On the other hand, switches you can try are: -O2 instead of -O3, -funroll-loops (usually good) -ffast-math, for both Ada and Fortran -gnatn, for Ada Since you are measuring real time and not CPU time, you might want also to take a bit larger matrices in order to have disk swaps and rests of initializations effects statistically small. HTH, Gautier ______________________________________________________________ Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm NB: For a direct answer, e-mail address on the Web site!! ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss ` (3 preceding siblings ...) 2006-10-20 12:12 ` Gautier @ 2006-10-20 12:35 ` Dmitry A. Kazakov 2006-10-20 15:53 ` Martin Krischik 2006-10-20 12:52 ` Gautier ` (7 subsequent siblings) 12 siblings, 1 reply; 68+ messages in thread From: Dmitry A. Kazakov @ 2006-10-20 12:35 UTC (permalink / raw) On 20 Oct 2006 03:47:44 -0700, tkrauss wrote: > I'm a bit stuck trying to figure out how to coax more performance > out of some Ada code. I suspect there is something simple (like > compiler switches) but I'm missing it. As an example I'm using > a simple matrix multiply and comparing it to similar code in > Fortran. Unfortunately the Ada code takes 3-4 times as long. > > I'm using GNAT (GPL 2006) and GFortran (4.2.0) and the following > compile options: > > gnat make -O3 -gnatp tst_array > gfortran -O3 tst_array.f95 > > Running them on 800x800 matrices (on my 2GHz laptop) > > for Ada: "tst_array 800" runs in 18 seconds > for Fortran "tst_array 800" runs in 6 seconds > > (if I use the fortran "matmul" intrinsic the fortran time drops to > 2.5 seconds) > > Note, I tried reordering the loops, removing the random calls, etc. > none of which made huge changes. There is something killing > performance > and/or a switch or two that I'm missing, but I can't seem to find it. > Any thoughts? > > Here is the code: [...] Pointers + index checks I'd say. Try this: with Ada.Numerics.Float_Random; use Ada.Numerics.Float_Random; with Ada.Command_Line; use Ada.Command_Line; with Ada.Text_IO; use Ada.Text_IO; with Ada.Calendar; use Ada.Calendar; procedure tst_array_2 is package F_IO is new Ada.Text_IO.Float_IO(Float); package D_IO is new Ada.Text_Io.Fixed_Io(Duration); type Real_Matrix is array(Integer range <>,Integer range <>) of Float; N : Positive; G : Ada.Numerics.Float_Random.Generator; Start, Finish : Ada.Calendar.Time; Sum : Float := 0.0; begin N := Positive'Value (Argument (1)); declare subtype Matrix is Real_Matrix (1..N, 1..N); A, B, C : Matrix; begin for I in A'range(1) loop for J in A'range(2) loop A(I,J) := Ada.Numerics.Float_Random.Random(G); B(I,J) := Ada.Numerics.Float_Random.Random(G); end loop; end loop; Start := Ada.Calendar.Clock; for I in A'range(1) loop for J in A'range(2) loop Sum := 0.0; for R in A'range(2) loop Sum := Sum + A(I,R)*B(R,J); end loop; C(I,J) := Sum; end loop; end loop; Finish := Ada.Calendar.Clock; F_IO.Put (C(1,1)); F_IO.Put (C(1,2)); New_Line; F_IO.Put (C(2,1)); F_IO.Put (C(2,2)); New_Line; end; Put ("Time: "); D_IO.Put(Finish-Start); New_Line;New_Line; end tst_array_2; -- Regards, Dmitry A. Kazakov http://www.dmitry-kazakov.de ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 12:35 ` Dmitry A. Kazakov @ 2006-10-20 15:53 ` Martin Krischik 0 siblings, 0 replies; 68+ messages in thread From: Martin Krischik @ 2006-10-20 15:53 UTC (permalink / raw) Dmitry A. Kazakov wrote: > declare > subtype Matrix is Real_Matrix (1..N, 1..N); > A, B, C : Matrix; > begin Indeed faster: ./Linux-x86_64-Release/Test_Array_1 800 1.99202E+02 1.96854E+02 1.88844E+02 1.84498E+02 Time: 10.459045000 ./Linux-x86_64-Release/Test_Array_2 800 1.99202E+02 1.96854E+02 1.88844E+02 1.84498E+02 Time: 6.313824000 Martin -- mailto://krischik@users.sourceforge.net Ada programming at: http://ada.krischik.com ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss ` (4 preceding siblings ...) 2006-10-20 12:35 ` Dmitry A. Kazakov @ 2006-10-20 12:52 ` Gautier 2006-10-20 13:27 ` claude.simon ` (6 subsequent siblings) 12 siblings, 0 replies; 68+ messages in thread From: Gautier @ 2006-10-20 12:52 UTC (permalink / raw) tkrauss wrote: > N := Positive'Value (Argument (1)); ... > for I in A'range(1) loop > for J in A'range(2) loop probably you have a penalty there against > do I = 1,N > do J = 1,N I guess the GNAT compiler can't tell that due to > A := new Real_Matrix(1..N, 1..N); esp. because alloc. is dynamic, you indeed have: for I in 1..N loop Then you should try that and even for I in reverse 1..N loop Perhaps g95 does that optimization when Ada has to stick with the mentioned direction. Also, compiling with -S and looking at the assembler code helps a lot... HTH, Gautier ______________________________________________________________ Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm NB: For a direct answer, e-mail address on the Web site! ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss ` (5 preceding siblings ...) 2006-10-20 12:52 ` Gautier @ 2006-10-20 13:27 ` claude.simon 2006-10-20 15:38 ` Robert A Duff ` (5 subsequent siblings) 12 siblings, 0 replies; 68+ messages in thread From: claude.simon @ 2006-10-20 13:27 UTC (permalink / raw) with some change you could speedup your code : First the matrix product is not optimum, you can transpose B. Second, matrix A is used vector by vector you can copy A (I,1..N) in a vector, say AI and replace A(I,R) by AI(R) in the product (it is to the compiler to see that :-( ). By apply this change I got an ada version faster that the fortran one (at the same -O3 optimization level). You can make the same change to your fortran code and then search a new trick to speedup your ada code :-). Claude SIMON tkrauss a écrit : > I'm a bit stuck trying to figure out how to coax more performance > out of some Ada code. I suspect there is something simple (like > compiler switches) but I'm missing it. As an example I'm using > a simple matrix multiply and comparing it to similar code in > Fortran. Unfortunately the Ada code takes 3-4 times as long. > > I'm using GNAT (GPL 2006) and GFortran (4.2.0) and the following > compile options: > > gnat make -O3 -gnatp tst_array > gfortran -O3 tst_array.f95 > > Running them on 800x800 matrices (on my 2GHz laptop) > > for Ada: "tst_array 800" runs in 18 seconds > for Fortran "tst_array 800" runs in 6 seconds > > (if I use the fortran "matmul" intrinsic the fortran time drops to > 2.5 seconds) > > Note, I tried reordering the loops, removing the random calls, etc. > none of which made huge changes. There is something killing > performance > and/or a switch or two that I'm missing, but I can't seem to find it. > Any > thoughts? > > > Here is the code: > > > -- tst_array.adb > with Ada.Numerics.Float_Random; use Ada.Numerics.Float_Random; > with Ada.Command_Line; use Ada.Command_Line; > with Ada.Text_IO; use Ada.Text_IO; > with Ada.Calendar; use Ada.Calendar; > > procedure tst_array is > package F_IO is new Ada.Text_IO.Float_IO(Float); > package D_IO is new Ada.Text_Io.Fixed_Io(Duration); > type Real_Matrix is array(Integer range <>,Integer range <>) of > Float; > type Matrix_Access is access Real_Matrix; > > N : Positive; > G : Ada.Numerics.Float_Random.Generator; > > A,B,C : Matrix_Access; > Start, Finish : Ada.Calendar.Time; > Sum : Float := 0.0; > begin > N := Positive'Value (Argument (1)); > Start := Ada.Calendar.Clock; > > A := new Real_Matrix(1..N, 1..N); > B := new Real_Matrix(1..N, 1..N); > C := new Real_Matrix(1..N, 1..N); > for I in A'range(1) loop > for J in A'range(2) loop > A(I,J) := Ada.Numerics.Float_Random.Random(G); > B(I,J) := Ada.Numerics.Float_Random.Random(G); > end loop; > end loop; > > for I in A'range(1) loop > for J in A'range(2) loop > Sum := 0.0; > for R in A'range(2) loop > Sum := Sum + A(I,R)*B(R,J); > end loop; > C(I,J) := Sum; > end loop; > end loop; > > Finish := Ada.Calendar.Clock; > > F_IO.Put (C(1,1)); F_IO.Put (C(1,2)); New_Line; > F_IO.Put (C(2,1)); F_IO.Put (C(2,2)); New_Line; > > Finish := Ada.Calendar.Clock; > Put ("Time: "); D_IO.Put(Finish-Start); New_Line;New_Line; > end tst_array; > > > > -- tst_array.f95 > program test_array > integer,parameter :: seed = 86456 > integer :: N > integer :: numArgs > real, dimension(:,:), allocatable :: a, b, c > real, dimension(:,:), allocatable :: d, e, f > real :: sum > character*100 buffer > real :: begin, finish > integer :: I, J, R > > call getarg(1,buffer) > read(buffer,*) N > > call srand(seed) > > begin = second() > > allocate( a(N,N) ) > allocate( b(N,N) ) > allocate( c(N,N) ) > forall (I = 1:N, J = 1:N) > a(i,j) = rand() > b(i,j) = rand() > end forall > > !c = matmul(a, b) > do I = 1,N > do J = 1,N > sum = 0.0 > do R = 1,N > sum = sum + A(I,R)*B(R,J) > end do > C(I,J) = sum > end do > end do > > print *, c(1,1), c(1,2), c(2,1), c(2,2) > finish = second() > print *, 'Time: ', (finish-begin) > > end program test_array ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss ` (6 preceding siblings ...) 2006-10-20 13:27 ` claude.simon @ 2006-10-20 15:38 ` Robert A Duff 2006-10-20 19:32 ` Gautier 2006-10-20 15:56 ` Jeffrey Creem ` (4 subsequent siblings) 12 siblings, 1 reply; 68+ messages in thread From: Robert A Duff @ 2006-10-20 15:38 UTC (permalink / raw) "tkrauss" <thomas.krauss@gmail.com> writes: > I'm a bit stuck trying to figure out how to coax more performance > out of some Ada code. ... In some cases, -O3 is slower than -O2. You could try that experiment. Is the Fortran compiler generating array-bounds checks? If not, pragma Suppress(All_Checks) in the Ada version will make the test more fair. - Bob ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 15:38 ` Robert A Duff @ 2006-10-20 19:32 ` Gautier 0 siblings, 0 replies; 68+ messages in thread From: Gautier @ 2006-10-20 19:32 UTC (permalink / raw) Robert A Duff: > Is the Fortran compiler generating array-bounds checks? If not, > pragma Suppress(All_Checks) in the Ada version will make the test more > fair. Thomas mentioned the -gnatp option, that has the same effect as "pragma Suppress(All_Checks)" in the source. G. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss ` (7 preceding siblings ...) 2006-10-20 15:38 ` Robert A Duff @ 2006-10-20 15:56 ` Jeffrey Creem 2006-10-20 16:30 ` Martin Krischik ` (3 subsequent siblings) 12 siblings, 0 replies; 68+ messages in thread From: Jeffrey Creem @ 2006-10-20 15:56 UTC (permalink / raw) tkrauss wrote: > I'm a bit stuck trying to figure out how to coax more performance > out of some Ada code. I suspect there is something simple (like > compiler switches) but I'm missing it. As an example I'm using > a simple matrix multiply and comparing it to similar code in > Fortran. Unfortunately the Ada code takes 3-4 times as long. > There have been a few useful comments (and quite a few not really useful ones) but in the end, it seems pretty clear to me that in this particular case GNAT sucks compared to the fortran version. I built the gcc "head" from gcc SVN with GNAT and Fortran to compare the same versions (at least as much as possible). I moved the start timing calls after the array allocation and filling so we just timing the matrix multiplication I end moved the timing calls to make sure we were not timing IO in either case (both original versions were timing part of the "put"). I replaced the "random" data with some fixed sane data just to be sure there was no funky "denormal" stuff happening that changed the speed. Very little change in the order of magnitude that the original poster was seeing (I pretty much get results with GNAT runnig about 2.6 times slower) so it was time to look at the assembly. I find it easier to read assembly using sse math so building gnat via gnatmake -g -f -gnatp -O3 -march=pentium4 -fomit-frame-pointer -mfpmath=sse tst_array and fotran via gfortran -O3 -g -march=pentium4 -fomit-frame-pointer -mfpmath=sse -c tst_array.f95 and then using objdump -D -S tst_array.o to look at them, you pretty quickly can see the problem. The "inner loop" of the fortran code looks like 2d0: 8d 04 19 lea (%ecx,%ebx,1),%eax 2d3: f3 0f 10 02 movss (%edx),%xmm0 2d7: f3 0f 59 44 85 04 mulss 0x4(%ebp,%eax,4),%xmm0 2dd: f3 0f 58 c8 addss %xmm0,%xmm1 2e1: 83 c1 01 add $0x1,%ecx 2e4: 01 f2 add %esi,%edx 2e6: 39 f9 cmp %edi,%ecx 2e8: 75 e6 jne 2d0 <MAIN__+0x2d0> The "inner loop of the Ada code looks like af2: 83 c6 01 add $0x1,%esi af5: 89 f0 mov %esi,%eax af7: 2b 44 24 28 sub 0x28(%esp),%eax afb: 03 44 24 30 add 0x30(%esp),%eax aff: 8b 5c 24 38 mov 0x38(%esp),%ebx b03: f3 0f 10 0c 83 movss (%ebx,%eax,4),%xmm1 b08: 8b 4d 00 mov 0x0(%ebp),%ecx b0b: 8b 45 0c mov 0xc(%ebp),%eax b0e: 8b 55 08 mov 0x8(%ebp),%edx b11: 8b 5c 24 78 mov 0x78(%esp),%ebx b15: 29 d3 sub %edx,%ebx b17: 89 f7 mov %esi,%edi b19: 29 cf sub %ecx,%edi b1b: 89 f9 mov %edi,%ecx b1d: 83 c0 01 add $0x1,%eax b20: 29 d0 sub %edx,%eax b22: 01 c0 add %eax,%eax b24: 01 c0 add %eax,%eax b26: ba 00 00 00 00 mov $0x0,%edx b2b: 0f 48 c2 cmovs %edx,%eax b2e: 0f af c8 imul %eax,%ecx b31: 8d 1c 99 lea (%ecx,%ebx,4),%ebx b34: 8b 44 24 3c mov 0x3c(%esp),%eax b38: f3 0f 10 04 03 movss (%ebx,%eax,1),%xmm0 b3d: f3 0f 59 c1 mulss %xmm1,%xmm0 b41: f3 0f 58 d0 addss %xmm0,%xmm2 b45: 3b 74 24 7c cmp 0x7c(%esp),%esi b49: 75 a7 jne af2 <_ada_tst_array+0x254> 28 Instructions v.s. 8 for fortran. The GNAT version never stood a chance. It really seems like GNAT is dropping the ball here. Granted small benchmarks can really lead one to believe things are better or worse than the truth but I don't think there is really an excuse in this case for this sort of performance. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss ` (8 preceding siblings ...) 2006-10-20 15:56 ` Jeffrey Creem @ 2006-10-20 16:30 ` Martin Krischik 2006-10-20 19:51 ` Gautier ` (2 subsequent siblings) 12 siblings, 0 replies; 68+ messages in thread From: Martin Krischik @ 2006-10-20 16:30 UTC (permalink / raw) tkrauss wrote: > Unfortunately the Ada code takes 3-4 times as long. Well, I collected the various suggestions made and created three test programs [1,2,3] where one is indeed faster then the next. GNAT 4.1.1 ./Linux-x86_64-Release/Test_Array_1 800 1.99202E+02 1.96854E+02 1.88844E+02 1.84498E+02 Time: 18.615262000 ./Linux-x86_64-Release/Test_Array_2 800 1.99202E+02 1.96854E+02 1.88844E+02 1.84498E+02 Time: 5.457141000 ./Linux-x86_64-Release/Test_Array_3 800 1.99202E+02 1.96854E+02 1.88844E+02 1.84498E+02 Time: 4.509063000 GNAT GPL 2006 (20060522-34) ./Linux-x86_64-Release/Test_Array_1 800 1.99202E+02 1.96854E+02 1.88844E+02 1.84498E+02 Time: 13.950607000 ./Linux-x86_64-Release/Test_Array_2 800 1.99202E+02 1.96854E+02 1.88844E+02 1.84498E+02 Time: 8.010260000 ./Linux-x86_64-Release/Test_Array_3 800 1.99202E+02 1.96854E+02 1.88844E+02 1.84498E+02 Time: 7.101502000 As options I used my trusted "Release" option which you can see here [4]. Martin [1]http://svn.sourceforge.net/viewvc/wikibook-ada/trunk/demos/Source/test_array_1.adb?view=markup [2]http://svn.sourceforge.net/viewvc/wikibook-ada/trunk/demos/Source/test_array_2.adb?view=markup [3]http://svn.sourceforge.net/viewvc/wikibook-ada/trunk/demos/Source/test_array_3.adb?view=markup [4]http://svn.sourceforge.net/viewvc/wikibook-ada/trunk/demos/GNAT/wikibook_ada.gpr?view=markup -- mailto://krischik@users.sourceforge.net Ada programming at: http://ada.krischik.com ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss ` (9 preceding siblings ...) 2006-10-20 16:30 ` Martin Krischik @ 2006-10-20 19:51 ` Gautier 2006-10-20 22:11 ` Jeffrey R. Carter 2006-10-21 12:39 ` Dr. Adrian Wrigley 12 siblings, 0 replies; 68+ messages in thread From: Gautier @ 2006-10-20 19:51 UTC (permalink / raw) Just a detail (should not make much, but who knows with Ada.Text_IO): the timer should be stopped just after the multiplication and before any output. This little bug is in both Fortran and Ada code (in the Ada code the finish time is even taken twice). G. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss ` (10 preceding siblings ...) 2006-10-20 19:51 ` Gautier @ 2006-10-20 22:11 ` Jeffrey R. Carter 2006-10-20 23:52 ` Jeffrey Creem 2006-10-21 12:39 ` Dr. Adrian Wrigley 12 siblings, 1 reply; 68+ messages in thread From: Jeffrey R. Carter @ 2006-10-20 22:11 UTC (permalink / raw) tkrauss wrote: > I'm a bit stuck trying to figure out how to coax more performance > out of some Ada code. I suspect there is something simple (like > compiler switches) but I'm missing it. As an example I'm using > a simple matrix multiply and comparing it to similar code in > Fortran. Unfortunately the Ada code takes 3-4 times as long. 1st, make sure you're timing the same thing. How does Ada matrix multiplication compare to FORTRAN? You don't know, because you're also timing the random number generators and I/O. Ada.Text_IO, especially, is quite heavy-weight compared to other languages. Here's my version of your program: with Ada.Numerics.Float_Random; with Ada.Command_Line; use Ada.Command_Line; with Ada.Text_IO; use Ada.Text_IO; with Ada.Calendar; use Ada.Calendar; procedure Tst_Array is package F_IO is new Ada.Text_IO.Float_IO (Float); package D_IO is new Ada.Text_Io.Fixed_Io (Duration); N : constant Positive := Integer'Value (Argument (1) ); type Real_Matrix is array (1 .. N, 1 .. N) of Float; pragma Convention (FORTRAN, Real_Matrix); G : Ada.Numerics.Float_Random.Generator; A,B : Real_Matrix := (others => (others => Ada.Numerics.Float_Random.Random (G) ) ); C : Real_Matrix := (others => (others => 0.0) ); Start, Finish : Ada.Calendar.Time; begin Start := Ada.Calendar.Clock; for I in A'range (1) loop for J in A'range (2) loop for R in A'range (2) loop C (I, J) := C (I, J) + A (I, R) * B (R, J); end loop; end loop; end loop; Finish := Ada.Calendar.Clock; F_IO.Put (C (1, 1) ); F_IO.Put (C (1, 2) ); New_Line; F_IO.Put (C (2, 1) ); F_IO.Put (C (2, 2) ); New_Line; Put ("Time: "); D_IO.Put (Finish - Start); New_Line; end Tst_Array; I compiled as gnatmake -O3 -gnatnp -fomit-frame-pointer tst_array With MinGW GNAT 3.4.2, Windows XP, and a 3.2 GHz Pentium 4 HT processor, I get C:\Documents and Settings\All Users\Documents\Code>tst_array 800 2.05839E+02 2.01230E+02 2.00866E+02 1.94039E+02 Time: 5.821082831 I don't have a FORTRAN compiler for comparison. FORTRAN will probably do better, but I wouldn't expect the difference to be as great as your values. -- Jeff Carter "I was hobbling along, minding my own business, all of a sudden, up he comes, cures me! One minute I'm a leper with a trade, next minute my livelihood's gone! Not so much as a 'by your leave!' You're cured, mate. Bloody do-gooder!" Monty Python's Life of Brian 76 ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 22:11 ` Jeffrey R. Carter @ 2006-10-20 23:52 ` Jeffrey Creem 2006-10-21 7:37 ` Gautier 0 siblings, 1 reply; 68+ messages in thread From: Jeffrey Creem @ 2006-10-20 23:52 UTC (permalink / raw) Jeffrey R. Carter wrote: > tkrauss wrote: > >> I'm a bit stuck trying to figure out how to coax more performance >> out of some Ada code. I suspect there is something simple (like >> compiler switches) but I'm missing it. As an example I'm using >> a simple matrix multiply and comparing it to similar code in >> Fortran. Unfortunately the Ada code takes 3-4 times as long. > > > 1st, make sure you're timing the same thing. How does Ada matrix > multiplication compare to FORTRAN? You don't know, because you're also > timing the random number generators and I/O. Ada.Text_IO, especially, is > quite heavy-weight compared to other languages. > > Here's my version of your program: > Note, I am the first one to jump to the defense of "Ada" in general but in this case, GNAT just plain sucks compared to GNU FORTRAN as it does a poor job on (at least) the inner loop (verified by looking at the output assembly) Jeff's (the other jeff :) modified version looks a little cleaner and actually runs faster (than even old "fixed version" that did not time the IO and made sure to just time the matrix multiply in both versions) but it does not time the zeroing of the elements of C which would be required if this were a real matrix multiply routine and not some test driver. However, even having said that, this not really equivilent version runs about 2x slower than the FORTRAN (with the same version of GCC) I don't see any meaningful excuse for why GNAT should be slower in this case but it clearly is. I tried looking at the tree dump generated by the front ends prior to going to the optimizer step (not something I have a lot of experience at) . One thing is clear is the trees generated by GNAT is quite a bit uglier and more verbose so it is not surprising that the optimizer is unable to fully clean things up resulting in the explosion of instructions at the assembly level. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 23:52 ` Jeffrey Creem @ 2006-10-21 7:37 ` Gautier 2006-10-21 16:35 ` Jeffrey Creem 2006-10-23 6:28 ` Martin Krischik 0 siblings, 2 replies; 68+ messages in thread From: Gautier @ 2006-10-21 7:37 UTC (permalink / raw) Jeffrey Creem: > Note, I am the first one to jump to the defense of "Ada" in general but > in this case, GNAT just plain sucks compared to GNU FORTRAN as it does a > poor job on (at least) the inner loop (verified by looking at the output > assembly) There is something strange... Martin Krischik was able to trim the overall time for the Ada code down to 24% of the first version (GNAT/GCC 4.1.1). This should make the Ada program as fast as the FORTRAN one, shouldn't it ? Maybe it's because the test is done on a 64 bit machine ? It needs some reconciliation... A good thing in that discussion would be that everybody shows each time - which GCC version - which machine - the execution time of the multiplication for both Ada and Fortran - which version of the Ada code (matrix on stack/heap, Fortran or Ada convention) Cheers, Gautier ______________________________________________________________ Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm NB: For a direct answer, e-mail address on the Web site! ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-21 7:37 ` Gautier @ 2006-10-21 16:35 ` Jeffrey Creem 2006-10-21 17:04 ` Pascal Obry 2006-10-21 18:28 ` tmoran 2006-10-23 6:28 ` Martin Krischik 1 sibling, 2 replies; 68+ messages in thread From: Jeffrey Creem @ 2006-10-21 16:35 UTC (permalink / raw) Gautier wrote: > Jeffrey Creem: > >> Note, I am the first one to jump to the defense of "Ada" in general >> but in this case, GNAT just plain sucks compared to GNU FORTRAN as it >> does a poor job on (at least) the inner loop (verified by looking at >> the output assembly) > > > There is something strange... Martin Krischik was able to trim the > overall time for the Ada code down to 24% of the first version (GNAT/GCC > 4.1.1). > This should make the Ada program as fast as the FORTRAN one, shouldn't it ? > Maybe it's because the test is done on a 64 bit machine ? > It needs some reconciliation... > A good thing in that discussion would be that everybody shows each time > - which GCC version > - which machine > - the execution time of the multiplication for both Ada and Fortran > - which version of the Ada code (matrix on stack/heap, Fortran or Ada > convention) > > Cheers, Gautier > ______________________________________________________________ > Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm > > NB: For a direct answer, e-mail address on the Web site! I'd certainly be willing to run a few benchmarks but the important thing here is that rather innocent looking code is running 2-4x slower than it "should". There are things that I think we can really rule out as being "the" factor. 1) Random number generator - I did timings (for both the Ada and FORTRAN) with timing moved to only cover matrix multiply. 2) Difference GCC versions - I built a fresh GCC from the GCC trunk for both Ada and FORTRAN 3) The Machine - I am running both on the same machine, though I suppose there could be differences in 32 bit v.s. 64 bit comparisons. 4) Runtime checks - both the original author (and I) ran with checks suppressed 5) O2/O3 - Actually, I could look at this some more with some other versions but a quick look when I first started seemed to indicate this was not the issue. A few other thoughts. Once the timing is limited to just the matrix multiply the stack/heap thing 'should' generally not matter. Some of the changes made to the Ada version make it not really the same program as the FORTRAN version and the same changes made to the FORTRAN one would also cause it to speed up (e.g. not counting the the zeroing of the target array during the accumulation phase). I have certainly seen some amazing performance from some Ada compiler sin the past and in general, on non-trivial benchmarks I am usually pretty happy with the output of GNAT as well but in this case it is not great. Further, I tried playing a bit with the new autovectorization capability of the near 4.X series of GCC (has to be specifically enabled) and found that even very very trivial cases would refuse to vectorize under Ada (though after I submitted the bug report to GCC, I found that FORTRAN fails to vectorize these too). One thing everyone needs to remember is that this example was (probably) not "Find the way to get the smallest value out of this test program" becuase there are always ways of doing some tweaks to a small enough region of code to make it better. If there is a 2-4x global slowdown in your 100KLOC program, you will never "get there" following the conventional wisdom of profiling and looking for the problems. Now, I am not suggesting that GNAT is globally 2-4x slower than GFORTRAN or anything like that (since that does not line up with what I have generally seen on larger code bases), but, if I were a manager picking a new language based on a set of long term goals for a project and saw that GNAT was running 2-4x slower and was still runninging 1.X to 3X slower after 2 days of Ada guru's looking at it, I'd probably jettison Ada (I know, I am mixing compilers and languages here, but in reality, that is what happens in the real world) and go with something else. And before the chorus of "processors are so fast that performance does not matter as much as safety and correctness" crowd starts getting too loud, let me point out that there are still many segments of the industry where performance does still indeed matter. Especially when one is trading adding a second processor to an embedded box against a vague promise of "betterness" in terms of safety down the road....Ok..Off the soapbox. So, in closing, if someone thinks they have "the best" version of that program they want timed against gfortran, post it here and I'll run them. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-21 16:35 ` Jeffrey Creem @ 2006-10-21 17:04 ` Pascal Obry 2006-10-21 21:22 ` Jeffrey Creem 2006-10-21 18:28 ` tmoran 1 sibling, 1 reply; 68+ messages in thread From: Pascal Obry @ 2006-10-21 17:04 UTC (permalink / raw) To: Jeffrey Creem Jeffrey Creem a �crit : > I'd certainly be willing to run a few benchmarks but the important thing > here is that rather innocent looking code is running 2-4x slower than it > "should". The first thing is to be sure that we are running the same program. Running your program with the following changes (as done in Fortran): 1. Using Sum tmp var for the computation for I in A'range (1) loop for J in A'range (2) loop Sum := 0.0; for R in A'range (2) loop Sum := Sum + A (I, R) * B (R, J); end loop; C (I, J) := Sum; end loop; end loop; 2. Using Long_Float instead of Float (I think Fortran float is a Long_Float, to be checked). I went from 7.8s to 4.8s (with 1) and to 4.2s (with 2). Pascal. -- --|------------------------------------------------------ --| Pascal Obry Team-Ada Member --| 45, rue Gabriel Peri - 78114 Magny Les Hameaux FRANCE --|------------------------------------------------------ --| http://www.obry.net --| "The best way to travel is by means of imagination" --| --| gpg --keyserver wwwkeys.pgp.net --recv-key C1082595 ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-21 17:04 ` Pascal Obry @ 2006-10-21 21:22 ` Jeffrey Creem 2006-10-22 3:03 ` Jeffrey Creem 0 siblings, 1 reply; 68+ messages in thread From: Jeffrey Creem @ 2006-10-21 21:22 UTC (permalink / raw) Pascal Obry wrote: > Jeffrey Creem a �crit : > > >>I'd certainly be willing to run a few benchmarks but the important thing >>here is that rather innocent looking code is running 2-4x slower than it >> "should". > > > The first thing is to be sure that we are running the same program. > > Running your program with the following changes (as done in Fortran): > > 1. Using Sum tmp var for the computation > > for I in A'range (1) loop > for J in A'range (2) loop > Sum := 0.0; > for R in A'range (2) loop > Sum := Sum + A (I, R) * B (R, J); > end loop; > C (I, J) := Sum; > end loop; > end loop; > > 2. Using Long_Float instead of Float (I think Fortran float is a > Long_Float, to be checked). > > I went from 7.8s to 4.8s (with 1) and to 4.2s (with 2). > > Pascal. > Actually, the original poster's Ada program had the temp var and all of my comparisons of programs that I have asserted were "the same" used the temporary. As for Long_Float v.s. Short_Float, gfortran is using 32 bit floats (as verified by dumping the tree representation and assembly language). Since we are all getting a bit confused by specific versions and numbers I thought I'd post a summary and create a way of tracking more global performance issues at the gnuada.sf.net wiki. The direct link to these test results are here: http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-21 21:22 ` Jeffrey Creem @ 2006-10-22 3:03 ` Jeffrey Creem 2006-10-22 7:39 ` Jeffrey R. Carter ` (2 more replies) 0 siblings, 3 replies; 68+ messages in thread From: Jeffrey Creem @ 2006-10-22 3:03 UTC (permalink / raw) Jeffrey Creem wrote: > Pascal Obry wrote: > >> Jeffrey Creem a �crit : >> >> >>> I'd certainly be willing to run a few benchmarks but the important thing >>> here is that rather innocent looking code is running 2-4x slower than it >>> "should". >> >> >> >> The first thing is to be sure that we are running the same program. >> >> Running your program with the following changes (as done in Fortran): >> >> 1. Using Sum tmp var for the computation >> >> for I in A'range (1) loop >> for J in A'range (2) loop >> Sum := 0.0; >> for R in A'range (2) loop >> Sum := Sum + A (I, R) * B (R, J); >> end loop; >> C (I, J) := Sum; >> end loop; >> end loop; >> >> 2. Using Long_Float instead of Float (I think Fortran float is a >> Long_Float, to be checked). >> >> I went from 7.8s to 4.8s (with 1) and to 4.2s (with 2). >> >> Pascal. >> > > Actually, the original poster's Ada program had the temp var and all of > my comparisons of programs that I have asserted were "the same" used the > temporary. > > As for Long_Float v.s. Short_Float, gfortran is using 32 bit floats (as > verified by dumping the tree representation and assembly language). > > Since we are all getting a bit confused by specific versions and numbers > I thought I'd post a summary and create a way of tracking more global > performance issues at the gnuada.sf.net wiki. > > The direct link to these test results are here: > > http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison > > > Actually, as a result of this, I submitted a bug report to the GCC bugzilla list. You can follow progress on it here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29543 Interesting initial feedback is that 1) Not an Ada bug. 2) Is a FORTRAN bug 3) Is a backend limitation of the optimizer. Of course, the FORTRAN one still runs correctly so I don't think most users will care that it is because of a bug :) ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 3:03 ` Jeffrey Creem @ 2006-10-22 7:39 ` Jeffrey R. Carter 2006-10-22 11:48 ` tkrauss ` (2 more replies) 2006-10-22 13:50 ` Alinabi 2006-10-22 15:57 ` Jeffrey Creem 2 siblings, 3 replies; 68+ messages in thread From: Jeffrey R. Carter @ 2006-10-22 7:39 UTC (permalink / raw) Jeffrey Creem wrote: > > Actually, as a result of this, I submitted a bug report to the GCC > bugzilla list. You can follow progress on it here: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29543 > > Interesting initial feedback is that > 1) Not an Ada bug. > 2) Is a FORTRAN bug > 3) Is a backend limitation of the optimizer. > > Of course, the FORTRAN one still runs correctly so I don't think most > users will care that it is because of a bug :) Interesting. I've been experimenting with some variations simply out of curiosity and found some things that seem a bit strange. (All results for an argument of 800.) Adding the Sum variable makes an important difference, as others have reported, in my case from 5.82 to 4.38 s. Hoisting the indexing calculation for the result (C) matrix location is a basic optimization, and I would be surprised if it isn't done. The only thing I can think of is that it's a cache issue: that all 3 matrices can't be kept in cache at once. Perhaps compiler writers would be able to make sense of this. Previously, I found no difference between -O2 and -O3. With this change, -O2 is faster. The issue of using 'range compared to using "1 .. N" makes no difference in my version of the program. Something I found really surprising is that putting the multiplication in a procedure makes the program faster, down to 4.03 s. I have no idea why this would be so. Compiled with MinGW GNAT 3.4.2, -O2, -gnatnp -fomit-frame-pointer. Run under Windows XP SP2 on a 3.2 GHz Pentium 4 HT with 1 GB RAM. Here's the code: with Ada.Numerics.Float_Random; with Ada.Command_Line; use Ada.Command_Line; with Ada.Text_IO; use Ada.Text_IO; with Ada.Calendar; use Ada.Calendar; procedure Tst_Array is package F_IO is new Ada.Text_IO.Float_IO (Float); package D_IO is new Ada.Text_Io.Fixed_Io (Duration); N : constant Positive := Integer'Value (Argument (1) ); type Real_Matrix is array (1 .. N, 1 .. N) of Float; pragma Convention (FORTRAN, Real_Matrix); G : Ada.Numerics.Float_Random.Generator; A,B : Real_Matrix := (others => (others => Ada.Numerics.Float_Random.Random (G) ) ); C : Real_Matrix := (others => (others => 0.0) ); Start, Finish : Ada.Calendar.Time; procedure Multiply is Sum : Float; begin -- Multiply All_Rows : for Row in A'range (1) loop All_Columns : for Column in B'range (2) loop Sum := 0.0; All_Common : for R in A'range (2) loop Sum := Sum + A (Row, R) * B (R, Column); end loop All_Common; C (Row, Column) := Sum; end loop All_Columns; end loop All_Rows; end Multiply; begin Start := Ada.Calendar.Clock; Multiply; Finish := Ada.Calendar.Clock; F_IO.Put (C (1, 1) ); F_IO.Put (C (1, 2) ); New_Line; F_IO.Put (C (2, 1) ); F_IO.Put (C (2, 2) ); New_Line; Put ("Time: "); D_IO.Put (Finish - Start); New_Line; end Tst_Array; Next, since there have been reported some meaningful speed-up of quick sort on a Pentium 4 HT processor by using 2 tasks, I thought I'd see what effect that had. With 2 tasks, I got a time of 3.70 s. That's not a significant speed up, about 9.1%. Same compilation options and platform. Here's that code: with Ada.Numerics.Float_Random; with Ada.Command_Line; use Ada.Command_Line; with Ada.Text_IO; use Ada.Text_IO; with Ada.Calendar; use Ada.Calendar; procedure Tst_Array is package F_IO is new Ada.Text_IO.Float_IO (Float); package D_IO is new Ada.Text_Io.Fixed_Io (Duration); N : constant Positive := Integer'Value (Argument (1) ); type Real_Matrix is array (1 .. N, 1 .. N) of Float; pragma Convention (FORTRAN, Real_Matrix); G : Ada.Numerics.Float_Random.Generator; A, B : Real_Matrix := (others => (others => Ada.Numerics.Float_Random.Random (G) ) ); C : Real_Matrix := (others => (others => 0.0) ); Start, Finish : Ada.Calendar.Time; procedure Multiply is procedure Multiply (Start_Row : in Positive; Stop_Row : in Positive) is Sum : Float; begin -- Multiply All_Rows : for Row in Start_Row .. Stop_Row loop All_Columns : for Column in B'range (2) loop Sum := 0.0; All_Common : for R in A'range (2) loop Sum := Sum + A (Row, R) * B (R, Column); end loop All_Common; C (Row, Column) := Sum; end loop All_Columns; end loop All_Rows; end Multiply; task type Multiplier (Start_Row : Positive; Stop_Row : Positive); task body Multiplier is -- null; begin -- Multiplier Multiply (Start_Row => Start_Row, Stop_Row => Stop_Row); end Multiplier; Stop : constant Positive := N / 2; Start : constant Positive := Stop + 1; Mult : Multiplier (Start_Row => 1, Stop_Row => Stop); begin -- Multiply Multiply (Start_Row => Start, Stop_Row => N); end Multiply; begin Start := Ada.Calendar.Clock; Multiply; Finish := Ada.Calendar.Clock; F_IO.Put (C (1, 1) ); F_IO.Put (C (1, 2) ); New_Line; F_IO.Put (C (2, 1) ); F_IO.Put (C (2, 2) ); New_Line; Put ("Time: "); D_IO.Put (Finish - Start); New_Line; end Tst_Array; If I inline the inner Multiply, or put equivalent code in the task and the outer Mutliply, the time is much more than for the sequential version, presumably due to cache effects. Since it appears you have 2 physical processors ("Dual Xeon 2.8 Ghz"), I would be interested in seeing what effect this concurrent version has on that platform. I also wonder how easy such a version would be to create in FORTRAN. -- Jeff Carter "Ada has made you lazy and careless. You can write programs in C that are just as safe by the simple application of super-human diligence." E. Robert Tisdale 72 ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 7:39 ` Jeffrey R. Carter @ 2006-10-22 11:48 ` tkrauss 2006-10-22 18:02 ` Georg Bauhaus 2006-10-22 20:20 ` Jeffrey R. Carter 2006-10-22 12:31 ` Gautier 2006-10-22 18:01 ` tmoran 2 siblings, 2 replies; 68+ messages in thread From: tkrauss @ 2006-10-22 11:48 UTC (permalink / raw) > If I inline the inner Multiply, or put equivalent code in the task and > the outer Mutliply, the time is much more than for the sequential > version, presumably due to cache effects. > > Since it appears you have 2 physical processors ("Dual Xeon 2.8 Ghz"), I > would be interested in seeing what effect this concurrent version has on > that platform. I also wonder how easy such a version would be to create > in FORTRAN. It's funny that you mention tasking since that's what got me started on this in the first place. I was introduced to OpenMP as a simple method of parallelizing code in fortran. Unfortunately it seems to be "tricky" to really get things in parallel (what data is shared, multiple tasks doing the same thing with the same memory, access violations, etc.) I remembered that Ada had tasking built in so I started playing with that. Now, as you can probably tell from my code, I haven't touched Ada in a _very_ long time, but it was surprisingly easy set up a simple two-task test program. Anyway, using the latest code from Jeffrey Creem it looks like the execution time (on my machine) has been cut in half (9 seconds) . The threaded version runs in nearly the same time for smaller problems but dies with a stack overflow for larger. I see a comment in the Bugzilla recommending a similar construct type Real_Matrix is array (1 .. N, 1 .. N) of Float; That takes the memory from the stack rather than the heap though, no? I assume there is a compiler switch to increase the stack size so the code wouldn't die, but is that the "normal" way of allocating memory? I'm trying to not look like _too_ much of an Ada neophyte :) ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 11:48 ` tkrauss @ 2006-10-22 18:02 ` Georg Bauhaus 2006-10-22 18:24 ` Jeffrey Creem 2006-10-22 20:20 ` Jeffrey R. Carter 1 sibling, 1 reply; 68+ messages in thread From: Georg Bauhaus @ 2006-10-22 18:02 UTC (permalink / raw) tkrauss wrote: > That takes the memory from the stack rather than the heap though, no? > I assume there is a compiler switch to increase the stack size so the > code wouldn't die, but is that the "normal" way of allocating > memory? I'm trying to not look like _too_ much of an Ada neophyte :) > I just tested a similar thing using two different compilers. A function that uses a local array of a size determined by a function parameter. The result was that - GNAT may trigger constraint errors when the array size exceeds some limit. The documentation tells the compiler user that this is an operating system issue and should be addressed at this level IIUC. (Does this comment apply in this case?) - ObjectAda had no problem with local arrays of the same tens of millions of components. I've measured performance by calling the function many times with different Size parameters for the array. This has consistently shown that on this platform, GNAT code takes about at least twice as much time as OA code. (The time is not affected by -O?.) Using an allocator "new Big_Array(1 .. Size)" and an instance of Unchecked_Deallocation in place of simple variable allocation has two effects. - The execution time of ObjectAda code almost does not change. This lets me suspect that OA isn't necessarily using the primary stack for local variables. Indeed, the assembly listing shows calls on rts_ss_allocate (for the local array variable) versus rts_allocate (for an array allocator). The RTS function rts_ss_allocate is the link_name of the function System.RTS.TGT.Sec_Stack_Pkg.Allocate from the ObjectAda RTS. - The "GNAT heap version" performs about 2-3 times faster than the "GNAT stack version". (GNAT code that uses "new" instead of a simple local variable performs minimally to just slightly faster than the OA code using "new".) So there is a major difference in execution speed of programs that compute larger matrices if you use GNAT and simple array variables, but not allocators. Execution speed does not differ significantly using at least one other compiler (ObjectAda). I think that's a pity when seen from a "just Ada" perspective. There is one more thing (allocators) you have to think about if you want speed for local array variables when using GNAT. Windows x86, GNAT GPL 2006, and OA 8.2 U15. The first, middle and last element of the array are referenced. Pragma volatile has been tried, too. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 18:02 ` Georg Bauhaus @ 2006-10-22 18:24 ` Jeffrey Creem 2006-10-23 0:10 ` Georg Bauhaus 0 siblings, 1 reply; 68+ messages in thread From: Jeffrey Creem @ 2006-10-22 18:24 UTC (permalink / raw) Georg Bauhaus wrote: > > I think that's a pity when seen from a "just Ada" perspective. There > is one more thing (allocators) you have to think about if you > want speed for local array variables when using GNAT. > > > Windows x86, GNAT GPL 2006, and OA 8.2 U15. The first, middle and last > element of the array are referenced. Pragma volatile > has been tried, too. It sounds like you are running some different code and I'd be hestitant to make any assertions about runtime for certain constructs without seeing it since 1) We just spent 3 days talking about the code that started this thread and there have been all sorts of assertions about things being faster/slower,etc. 2) You mention something about just accessing first, middle and last of your arrays so it really sounds like you really are just timing allocations and and not actually really hitting the array indexing (though hard to say without seeing the code). Don't take either of those as any sort of severe criticism but I am just trying to make sure that any discussions of performance are tied to clear,open and repeatable measurements so we all know what each other are talkig about. The problem is that too often, performance critisim is done via vague references rather than hard facts and thus they get dismissed (sometimes) by some compiler writers. The bug submitted on this issue (whether or not it ever gets fixed) certainly has generated real and useful discussions amoung the developers with few of the typical 'silly user' responses that we sometimes see. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 18:24 ` Jeffrey Creem @ 2006-10-23 0:10 ` Georg Bauhaus 0 siblings, 0 replies; 68+ messages in thread From: Georg Bauhaus @ 2006-10-23 0:10 UTC (permalink / raw) On Sun, 2006-10-22 at 14:24 -0400, Jeffrey Creem wrote: > It sounds like you are running some different code and I'd be hestitant > to make any assertions about runtime for certain constructs without > seeing it since It is appended below. > 2) You mention something about just accessing first, middle and last of > your arrays so it really sounds like you really are just timing > allocations and and not actually really hitting the array indexing > (though hard to say without seeing the code). Yes, I had concentrated on just this question, asked by Thomas Krauss, about stack versus heap allocation. If a program is not frequently allocating and freeing arrays (or matrices, etc.), the the effects might be less of an issue, if they are an issue at all when there is no initialization. But I imagined allocation is just what is happening all the time if you have a function that computes matrices of varying sizes? OTOH, I find very different results on GNU/Linux (for which I only have a number of GNATs). Stack allocation appears to be consuming next to no time. (Commit on write? Just remembering an offset?) Below the following program, there is a patch that adds trivial initialization loops. With them, a comparison of stack vs heap on GNU/Linux seems in favor of the stack. I can't check this on Windows right now, though. with Ada.Calendar; with Ada.Text_IO; with Ada.Unchecked_Deallocation; procedure main is use Ada.Calendar, Ada; type LIST is array (NATURAL range <>) of BOOLEAN; for LIST'component_size use 8; -- for a "normal" array, not packed or other magic type LIST_PTR is access LIST; procedure free is new Ada.Unchecked_Deallocation (LIST, LIST_PTR); accu: BOOLEAN := false; -- The allocating functions read and write this variable -- using components of the local arrays. -- (This should prevent some optimizations.) function allocate(size: POSITIVE) return BOOLEAN; -- use a local `LIST` of length `size` function allocate_heap(size: POSITIVE) return BOOLEAN; -- use a pointer to a new `LIST` of length `size` function image(t: TIME) return STRING; -- the current time as a `STRING` value function allocate(size: POSITIVE) return BOOLEAN is done: LIST(1 .. size); -- pragma volatile(done); result: BOOLEAN; begin if done(size / 2) then result := false; end if; done(done'last) := accu and done(done'first); result := done(done'last) and done(done'first); return result; end allocate; function allocate_heap(size: POSITIVE) return BOOLEAN is done: LIST_PTR; result: BOOLEAN; begin done := new LIST(1 .. size); if done(size / 2) then result := false; end if; done(done'last) := accu and done(done'first); result := done(done'first) and done(done'first); Free(done); return result; end allocate_heap; function image(t: TIME) return STRING is year: YEAR_NUMBER; day: DAY_NUMBER; month: MONTH_NUMBER; sec: DURATION; begin split(t, year, month, day, sec); return YEAR_NUMBER'image(year) & MONTH_NUMBER'image(month) & DAY_NUMBER'image(day) & DURATION'image(sec); end image; start, finish: TIME; begin Text_IO.put_line("using a stack"); start := clock; for run in 1 .. 25 loop for k in 1 .. 2000 loop accu := allocate(5000 * k); end loop; end loop; finish := clock; Text_IO.put_line("from" & image(start) & " to" & image(finish) & " = " & DURATION'image(finish - start)); Text_IO.put_line("accu " & BOOLEAN'image(accu)); Text_IO.put_line("using a heap"); start := clock; for run in 1 .. 25 loop for k in 1 .. 2000 loop accu := allocate_heap(5000 * k); end loop; end loop; finish := clock; Text_IO.put_line("from" & image(start) & " to" & image(finish) & " = " & DURATION'image(finish - start)); Text_IO.put_line("accu " & BOOLEAN'image(accu)); end main; --- stack_use_testing.ada 2006/10/22 23:51:48 1.1 +++ stack_use_testing.ada 2006/10/22 23:52:12 @@ -33,2 +33,3 @@ begin + for k in done'range loop done(k) := false; end loop; if done(size / 2) then @@ -46,2 +47,3 @@ done := new LIST(1 .. size); + for k in done'range loop done(k) := false; end loop; if done(size / 2) then @@ -75,4 +77,4 @@ start := clock; - for run in 1 .. 25 loop - for k in 1 .. 2000 loop + for run in 1 .. 2 loop + for k in 1 .. 200 loop accu := allocate(5000 * k); @@ -90,4 +92,4 @@ start := clock; - for run in 1 .. 25 loop - for k in 1 .. 2000 loop + for run in 1 .. 2 loop + for k in 1 .. 200 loop accu := allocate_heap(5000 * k); ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 11:48 ` tkrauss 2006-10-22 18:02 ` Georg Bauhaus @ 2006-10-22 20:20 ` Jeffrey R. Carter 1 sibling, 0 replies; 68+ messages in thread From: Jeffrey R. Carter @ 2006-10-22 20:20 UTC (permalink / raw) tkrauss wrote: > > Anyway, using the latest code from Jeffrey Creem it looks like the > execution time (on my machine) has been cut in half (9 seconds) . The > threaded version runs in nearly the same time for smaller problems but > dies with a stack overflow for larger. I see a comment in the Bugzilla > recommending a similar construct > type Real_Matrix is array (1 .. N, 1 .. N) of > Float; > That takes the memory from the stack rather than the heap though, no? > I assume there is a compiler switch to increase the stack size so the > code wouldn't die, but is that the "normal" way of allocating > memory? I'm trying to not look like _too_ much of an Ada neophyte :) With GNAT, at least, they're allocated on the stack. AFAIKT, the language doesn't require that, but it's a common approach. The main stack size is probably an OS thing. -- Jeff Carter "I unclog my nose towards you." Monty Python & the Holy Grail 11 ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 7:39 ` Jeffrey R. Carter 2006-10-22 11:48 ` tkrauss @ 2006-10-22 12:31 ` Gautier 2006-10-22 20:26 ` Jeffrey R. Carter 2006-10-22 18:01 ` tmoran 2 siblings, 1 reply; 68+ messages in thread From: Gautier @ 2006-10-22 12:31 UTC (permalink / raw) Jeffrey R. Carter: > Adding the Sum variable makes an important difference, as others have > reported, in my case from 5.82 to 4.38 s. Hoisting the indexing > calculation for the result (C) matrix location is a basic optimization, > and I would be surprised if it isn't done. The only thing I can think of > is that it's a cache issue: that all 3 matrices can't be kept in cache > at once. Perhaps compiler writers would be able to make sense of this. The Sum variable was *removed* by someone at some point of the discussion in order to challenge a bit more the Ada compiler's optimizer. If you replace a good algorithm by a bad one, don't be surprised that the program is slow. At some point of bad coding the best code optimizer won't be able to help you. Eventually the optimizer will transform this: for R in A'range (2) loop C (I, J) := C (I, J) + A (I, R) * B (R, J); end loop; into something like: Sum:= C (I, J); -- hopefully a Sum is mapped to a register for R in A'range (2) loop Sum := Sum + A (I, R) * B (R, J); end loop; C (I, J):= Sum; but in that case it probably won't be able to guess that the C(I,J) was zeroed before and replace the first line by: Sum:= 0.0; sparing the reading of C(I,J) (must cost much time...). If you are luckier, the optimizer will do Sum:= 0.0; for R in A'range (2) loop Sum := Sum + A (I, R) * B (R, J); end loop; C (I, J):= C (I, J) + Sum; But still, it won't spare the time lost to fill the C matrix with zeros. If you want to do a benchmark with Fortran, it's really not a good idea to begin with "pessimizing" the Ada code. Cheers Gautier ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 12:31 ` Gautier @ 2006-10-22 20:26 ` Jeffrey R. Carter 2006-10-22 21:22 ` Gautier 0 siblings, 1 reply; 68+ messages in thread From: Jeffrey R. Carter @ 2006-10-22 20:26 UTC (permalink / raw) Gautier wrote: > > The Sum variable was *removed* by someone at some point of the > discussion in order to challenge a bit more the Ada compiler's optimizer. > If you replace a good algorithm by a bad one, don't be surprised that > the program is slow. At some point of bad coding the best code optimizer > won't be able to help you. I did that in my 1st version. I wanted to see if the optimizer would result in equivalent code. No such luck. > Eventually the optimizer will transform this: > for R in A'range (2) loop > C (I, J) := C (I, J) + A (I, R) * B (R, J); > end loop; > into something like: > Sum:= C (I, J); -- hopefully a Sum is mapped to a register > for R in A'range (2) loop > Sum := Sum + A (I, R) * B (R, J); > end loop; > C (I, J):= Sum; > but in that case it probably won't be able to guess that the C(I,J) was > zeroed before and replace the first line by: > Sum:= 0.0; Thanks for the discussion. The initialization of C is static, so a good optimizer could. They're hard to find, though. > sparing the reading of C(I,J) (must cost much time...). > If you are luckier, the optimizer will do > Sum:= 0.0; > for R in A'range (2) loop > Sum := Sum + A (I, R) * B (R, J); > end loop; > C (I, J):= C (I, J) + Sum; > But still, it won't spare the time lost to fill the C matrix with zeros. I didn't include that in the timing. > If you want to do a benchmark with Fortran, it's really not a good idea > to begin with "pessimizing" the Ada code. I'm more interested in seeing what makes a difference in the Ada. In this case, the high-level features that let you write less code. -- Jeff Carter "I unclog my nose towards you." Monty Python & the Holy Grail 11 ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 20:26 ` Jeffrey R. Carter @ 2006-10-22 21:22 ` Gautier 0 siblings, 0 replies; 68+ messages in thread From: Gautier @ 2006-10-22 21:22 UTC (permalink / raw) Jeffrey R. Carter: >> But still, it won't spare the time lost to fill the C matrix with zeros. > > I didn't include that in the timing. But you have to! Otherwise you are cheating. The zeroing is a part of your changed algorithm; without zeroing the result is wrong. >> If you want to do a benchmark with Fortran, it's really not a good >> idea to begin with "pessimizing" the Ada code. > > I'm more interested in seeing what makes a difference in the Ada. In > this case, the high-level features that let you write less code. It is less code... "on the paper". In fact, it is more work, or, at best, if the code optimizer is really smart enough to combine the ":= (others => (others => 0.0) );" with the loops, it is the same work. Maybe that's the problem with high-level features: one can zero a matrix in so few characters that it seems to be done instantly ;-). Cheers Gautier ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 7:39 ` Jeffrey R. Carter 2006-10-22 11:48 ` tkrauss 2006-10-22 12:31 ` Gautier @ 2006-10-22 18:01 ` tmoran 2006-10-22 20:54 ` Jeffrey R. Carter 2 siblings, 1 reply; 68+ messages in thread From: tmoran @ 2006-10-22 18:01 UTC (permalink / raw) > ... the time is much more than for the sequential > version, presumably due to cache effects. Each 800x800 Float matrix is about 2.5 megabytes so I would expect two threads to be fighting each other over the cache. What are the single vs dual threaded times for, say, 250x250 arrays and a 1 MB cache on your machine, and on the dual Xeon machine? Curiously, on a 3 GHz Pentium 4 with 1 MB cache and compiling with -O2 -gnatp using the venerable gnat 3.15p, I get Time: 8.511855190 800 1 thread Time: 5.682686332 800 2 threads for a speedup of 33% Time: 0.092724434 200 1 thread Time: 0.059209520 200 2 threads for a speedup of 36% ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 18:01 ` tmoran @ 2006-10-22 20:54 ` Jeffrey R. Carter 0 siblings, 0 replies; 68+ messages in thread From: Jeffrey R. Carter @ 2006-10-22 20:54 UTC (permalink / raw) tmoran@acm.org wrote: > Each 800x800 Float matrix is about 2.5 megabytes so I would expect two > threads to be fighting each other over the cache. What are the single vs > dual threaded times for, say, 250x250 arrays and a 1 MB cache on your > machine, and on the dual Xeon machine? > Curiously, on a 3 GHz Pentium 4 with 1 MB cache and compiling > with -O2 -gnatp using the venerable gnat 3.15p, I get > Time: 8.511855190 800 1 thread > Time: 5.682686332 800 2 threads > for a speedup of 33% > Time: 0.092724434 200 1 thread > Time: 0.059209520 200 2 threads > for a speedup of 36% On my 3.2 GHz P4 HT (best of 3 runs), I get Time: 0.170563050 250 1 thread Time: 0.134637140 250 2 threads for a speedup of 24%. Time: 0.113739732 200 1 thread Time: 0.070599932 200 2 threads for a speedup of 36%. I'm still only getting about 10% speedup for 800. Is it possible my cache is smaller? -- Jeff Carter "I unclog my nose towards you." Monty Python & the Holy Grail 11 ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 3:03 ` Jeffrey Creem 2006-10-22 7:39 ` Jeffrey R. Carter @ 2006-10-22 13:50 ` Alinabi 2006-10-22 15:41 ` Jeffrey Creem 2006-10-22 15:57 ` Jeffrey Creem 2 siblings, 1 reply; 68+ messages in thread From: Alinabi @ 2006-10-22 13:50 UTC (permalink / raw) I ran your test programs compiled with gcc 4.0.3 and the following optimizations: COMMON_FLAGS=-g -march=opteron -mtune=opteron -mfpmath=sse -fomit-frame-pointer -O2 -fdump-tree-optimized and I cannot reproduce the large differences in performance everyone else talks about. Here are the times I get: N Ada Fortran ==================== 64 0.002029 0.000000 128 0.016321 0.016000 256 0.214143 0.204013 512 3.125888 3.124195 800 6.374982 5.864366 1024 34.10479 35.22620 2048 277.3071 283.2417 Jeffrey Creem wrote: > Jeffrey Creem wrote: > > Pascal Obry wrote: > > > >> Jeffrey Creem a écrit : > >> > >> > >>> I'd certainly be willing to run a few benchmarks but the important thing > >>> here is that rather innocent looking code is running 2-4x slower than it > >>> "should". > >> > >> > >> > >> The first thing is to be sure that we are running the same program. > >> > >> Running your program with the following changes (as done in Fortran): > >> > >> 1. Using Sum tmp var for the computation > >> > >> for I in A'range (1) loop > >> for J in A'range (2) loop > >> Sum := 0.0; > >> for R in A'range (2) loop > >> Sum := Sum + A (I, R) * B (R, J); > >> end loop; > >> C (I, J) := Sum; > >> end loop; > >> end loop; > >> > >> 2. Using Long_Float instead of Float (I think Fortran float is a > >> Long_Float, to be checked). > >> > >> I went from 7.8s to 4.8s (with 1) and to 4.2s (with 2). > >> > >> Pascal. > >> > > > > Actually, the original poster's Ada program had the temp var and all of > > my comparisons of programs that I have asserted were "the same" used the > > temporary. > > > > As for Long_Float v.s. Short_Float, gfortran is using 32 bit floats (as > > verified by dumping the tree representation and assembly language). > > > > Since we are all getting a bit confused by specific versions and numbers > > I thought I'd post a summary and create a way of tracking more global > > performance issues at the gnuada.sf.net wiki. > > > > The direct link to these test results are here: > > > > http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison > > > > > > > > > > Actually, as a result of this, I submitted a bug report to the GCC > bugzilla list. You can follow progress on it here: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29543 > > Interesting initial feedback is that > 1) Not an Ada bug. > 2) Is a FORTRAN bug > 3) Is a backend limitation of the optimizer. > > Of course, the FORTRAN one still runs correctly so I don't think most > users will care that it is because of a bug :) ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 13:50 ` Alinabi @ 2006-10-22 15:41 ` Jeffrey Creem 2006-10-23 0:02 ` Alinabi 0 siblings, 1 reply; 68+ messages in thread From: Jeffrey Creem @ 2006-10-22 15:41 UTC (permalink / raw) Alinabi wrote: > I ran your test programs compiled with gcc 4.0.3 and the following > optimizations: > COMMON_FLAGS=-g -march=opteron -mtune=opteron -mfpmath=sse > -fomit-frame-pointer -O2 -fdump-tree-optimized > and I cannot reproduce the large differences in performance everyone > else talks about. Here are the times I get: > > N Ada Fortran > ==================== > 64 0.002029 0.000000 > 128 0.016321 0.016000 > 256 0.214143 0.204013 > 512 3.125888 3.124195 > 800 6.374982 5.864366 > 1024 34.10479 35.22620 > 2048 277.3071 283.2417 > That is interesting. The question then becomes has FORTRAN improved on the way to 4.2.0 or has Ada regressed. Try doing a make dis_all which should produce annotated assembly output. The Ada version can be a little daunting in the way we have setup the files since the generic instantiations at the top full the .S files (woops, looked like I named them .dis) with the generic instatance. Even if you are not an assembly guru, if you start from the bottom of the files you can usually pretty quickly find that inner loop and compare the number of statements. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 15:41 ` Jeffrey Creem @ 2006-10-23 0:02 ` Alinabi 2006-10-23 5:28 ` Gautier 0 siblings, 1 reply; 68+ messages in thread From: Alinabi @ 2006-10-23 0:02 UTC (permalink / raw) Ok, it appears that the score is Fortran: 13 instructions -- Ada: 17 instructions in the inner loop. Aggain, this is compiled with gcc 4.0.3 with the following switches: -g -march=opteron -mtune=opteron -mfpmath=sse -fomit-frame-pointer -O2 -fdump-tree-optimized Fortran inner loop (13 instructions): ============= 3d4: 48 63 c1 movslq %ecx,%rax 3d7: ff c1 inc %ecx 3d9: 48 89 c2 mov %rax,%rdx 3dc: 49 0f af c2 imul %r10,%rax 3e0: 48 0f af d3 imul %rbx,%rdx 3e4: 4c 01 c8 add %r9,%rax 3e7: 48 01 f0 add %rsi,%rax 3ea: 48 8d 14 17 lea (%rdi,%rdx,1),%rdx 3ee: 44 39 c1 cmp %r8d,%ecx 3f1: f3 41 0f 10 04 83 movss (%r11,%rax,4),%xmm0 3f7: f3 41 0f 59 04 94 mulss (%r12,%rdx,4),%xmm0 3fd: f3 0f 58 c8 addss %xmm0,%xmm1 401: 75 d1 jne 3d4 <MAIN__+0x3d4> Ada inner loop (17 instructions): ============== a90: ff c6 inc %esi a92: 48 63 d6 movslq %esi,%rdx a95: 48 89 d0 mov %rdx,%rax a98: 48 29 e8 sub %rbp,%rax a9b: 4c 01 d8 add %r11,%rax a9e: f3 41 0f 10 0c 84 movss (%r12,%rax,4),%xmm1 aa4: 4c 89 c0 mov %r8,%rax aa7: 4c 29 d2 sub %r10,%rdx aaa: 45 31 c9 xor %r9d,%r9d aad: 48 83 c0 04 add $0x4,%rax ab1: 49 0f 48 c1 cmovs %r9,%rax ab5: 48 0f af d0 imul %rax,%rdx ab9: f3 0f 10 04 17 movss (%rdi,%rdx,1),%xmm0 abe: f3 0f 59 c1 mulss %xmm1,%xmm0 ac2: f3 0f 58 d0 addss %xmm0,%xmm2 ac6: 39 de cmp %ebx,%esi ac8: 75 c6 jne a90 <_ada_tst_array+0x3a0> Now, since Jeffrey was right about me not being an assembly guru, here is the assembly code for all three nested loops, so that you can doublecheck yourself. Fortran: ======== do I = 1,N 2f2: 8b 84 24 ac 01 00 00 mov 0x1ac(%rsp),%eax 2f9: f3 0f 11 44 24 28 movss %xmm0,0x28(%rsp) 2ff: 85 c0 test %eax,%eax 301: 0f 8e 28 01 00 00 jle 42f <MAIN__+0x42f> do J = 1,N sum = 0.0 do R = 1,N sum = sum + A(I,R)*B(R,J) 307: 48 8b 8c 24 28 01 00 mov 0x128(%rsp),%rcx 30e: 00 30f: 48 8b 94 24 18 01 00 mov 0x118(%rsp),%rdx 316: 00 317: 44 8d 40 01 lea 0x1(%rax),%r8d 31b: 4c 8b a4 24 10 01 00 mov 0x110(%rsp),%r12 322: 00 323: 48 8b 9c 24 40 01 00 mov 0x140(%rsp),%rbx 32a: 00 32b: 4c 8b 9c 24 60 01 00 mov 0x160(%rsp),%r11 332: 00 333: 4c 8b 94 24 78 01 00 mov 0x178(%rsp),%r10 33a: 00 33b: 48 89 4c 24 10 mov %rcx,0x10(%rsp) 340: 48 89 54 24 18 mov %rdx,0x18(%rsp) 345: 48 8b 8c 24 90 01 00 mov 0x190(%rsp),%rcx 34c: 00 end do C(I,J) = sum 34d: 48 8b 94 24 c0 00 00 mov 0xc0(%rsp),%rdx 354: 00 355: 4c 8b 8c 24 68 01 00 mov 0x168(%rsp),%r9 35c: 00 35d: 4c 8b bc 24 f0 00 00 mov 0xf0(%rsp),%r15 364: 00 365: 0f 57 d2 xorps %xmm2,%xmm2 368: c7 44 24 30 01 00 00 movl $0x1,0x30(%rsp) 36f: 00 370: 48 89 4c 24 20 mov %rcx,0x20(%rsp) 375: 48 89 54 24 48 mov %rdx,0x48(%rsp) 37a: 48 8b 8c 24 d8 00 00 mov 0xd8(%rsp),%rcx 381: 00 382: 48 8b 94 24 c8 00 00 mov 0xc8(%rsp),%rdx 389: 00 38a: 48 89 4c 24 40 mov %rcx,0x40(%rsp) 38f: 48 89 54 24 38 mov %rdx,0x38(%rsp) 394: 48 63 44 24 30 movslq 0x30(%rsp),%rax 399: 48 8b 54 24 10 mov 0x10(%rsp),%rdx 39e: 41 bd 01 00 00 00 mov $0x1,%r13d 3a4: 48 8b 4c 24 18 mov 0x18(%rsp),%rcx 3a9: 48 0f af d0 imul %rax,%rdx 3ad: 48 0f af 44 24 40 imul 0x40(%rsp),%rax 3b3: 48 8d 3c 0a lea (%rdx,%rcx,1),%rdi 3b7: 48 8b 54 24 38 mov 0x38(%rsp),%rdx 3bc: 4c 8d 34 10 lea (%rax,%rdx,1),%r14 3c0: 48 8b 74 24 20 mov 0x20(%rsp),%rsi 3c5: 49 63 ed movslq %r13d,%rbp 3c8: b9 01 00 00 00 mov $0x1,%ecx 3cd: 0f 28 ca movaps %xmm2,%xmm1 3d0: 48 0f af f5 imul %rbp,%rsi 3d4: 48 63 c1 movslq %ecx,%rax 3d7: ff c1 inc %ecx 3d9: 48 89 c2 mov %rax,%rdx 3dc: 49 0f af c2 imul %r10,%rax 3e0: 48 0f af d3 imul %rbx,%rdx 3e4: 4c 01 c8 add %r9,%rax 3e7: 48 01 f0 add %rsi,%rax 3ea: 48 8d 14 17 lea (%rdi,%rdx,1),%rdx 3ee: 44 39 c1 cmp %r8d,%ecx 3f1: f3 41 0f 10 04 83 movss (%r11,%rax,4),%xmm0 3f7: f3 41 0f 59 04 94 mulss (%r12,%rdx,4),%xmm0 3fd: f3 0f 58 c8 addss %xmm0,%xmm1 401: 75 d1 jne 3d4 <MAIN__+0x3d4> 403: 4c 89 f8 mov %r15,%rax 406: 48 8b 54 24 48 mov 0x48(%rsp),%rdx 40b: 41 ff c5 inc %r13d 40e: 48 0f af c5 imul %rbp,%rax 412: 41 39 cd cmp %ecx,%r13d 415: 49 8d 04 06 lea (%r14,%rax,1),%rax 419: f3 0f 11 0c 82 movss %xmm1,(%rdx,%rax,4) 41e: 75 a0 jne 3c0 <MAIN__+0x3c0> 420: ff 44 24 30 incl 0x30(%rsp) 424: 44 39 6c 24 30 cmp %r13d,0x30(%rsp) 429: 0f 85 65 ff ff ff jne 394 <MAIN__+0x394> end do end do Ada: ============= for I in A'range(1) loop 9a5: 41 8b 45 00 mov 0x0(%r13),%eax 9a9: 41 8b 55 04 mov 0x4(%r13),%edx 9ad: 39 d0 cmp %edx,%eax 9af: 0f 8f 8a 01 00 00 jg b3f <_ada_tst_array+0x44f> for J in A'range(2) loop Sum := 0.0; for R in A'range(2) loop Sum := Sum + A(I,R)*B(R,J); 9b5: 4c 8b bc 24 d0 00 00 mov 0xd0(%rsp),%r15 9bc: 00 end loop; C(I,J) := Sum; 9bd: 48 8b 8c 24 c0 00 00 mov 0xc0(%rsp),%rcx 9c4: 00 9c5: 4c 8b a4 24 e0 00 00 mov 0xe0(%rsp),%r12 9cc: 00 9cd: 89 44 24 68 mov %eax,0x68(%rsp) 9d1: 4c 89 7c 24 50 mov %r15,0x50(%rsp) 9d6: 48 89 8c 24 88 00 00 mov %rcx,0x88(%rsp) 9dd: 00 9de: 45 8b 75 08 mov 0x8(%r13),%r14d 9e2: 45 8b 6d 0c mov 0xc(%r13),%r13d 9e6: 44 89 6c 24 6c mov %r13d,0x6c(%rsp) 9eb: 4c 63 7c 24 6c movslq 0x6c(%rsp),%r15 9f0: 48 63 f8 movslq %eax,%rdi 9f3: 0f 57 e4 xorps %xmm4,%xmm4 9f6: 48 89 7c 24 20 mov %rdi,0x20(%rsp) 9fb: 49 63 ee movslq %r14d,%rbp 9fe: 89 54 24 2c mov %edx,0x2c(%rsp) a02: 44 89 eb mov %r13d,%ebx a05: 4c 89 7c 24 30 mov %r15,0x30(%rsp) a0a: 44 3b 74 24 6c cmp 0x6c(%rsp),%r14d a0f: 0f 8f 17 01 00 00 jg b2c <_ada_tst_array+0x43c> a15: 48 8b 44 24 30 mov 0x30(%rsp),%rax a1a: ba 00 00 00 00 mov $0x0,%edx a1f: 45 89 f5 mov %r14d,%r13d a22: 0f 28 dc movaps %xmm4,%xmm3 a25: 48 8b 4c 24 40 mov 0x40(%rsp),%rcx a2a: 48 29 e8 sub %rbp,%rax a2d: 48 8d 04 85 04 00 00 lea 0x4(,%rax,4),%rax a34: 00 a35: 48 85 c0 test %rax,%rax a38: 48 0f 48 c2 cmovs %rdx,%rax a3c: 48 63 54 24 68 movslq 0x68(%rsp),%rdx a41: 48 c1 f8 02 sar $0x2,%rax a45: 48 89 54 24 70 mov %rdx,0x70(%rsp) a4a: 48 2b 54 24 20 sub 0x20(%rsp),%rdx a4f: 49 89 d3 mov %rdx,%r11 a52: 4c 0f af d8 imul %rax,%r11 a56: 8b 01 mov (%rcx),%eax a58: 4c 63 d0 movslq %eax,%r10 a5b: 8b 41 0c mov 0xc(%rcx),%eax a5e: 48 98 cltq a60: 8b 51 08 mov 0x8(%rcx),%edx a63: 48 63 d2 movslq %edx,%rdx a66: 48 29 d0 sub %rdx,%rax a69: 48 89 14 24 mov %rdx,(%rsp) a6d: 4c 8d 04 85 00 00 00 lea 0x0(,%rax,4),%r8 a74: 00 a75: 49 63 cd movslq %r13d,%rcx a78: 4c 8b 7c 24 50 mov 0x50(%rsp),%r15 a7d: 44 89 f6 mov %r14d,%esi a80: 48 89 c8 mov %rcx,%rax a83: 48 2b 04 24 sub (%rsp),%rax a87: 0f 28 d3 movaps %xmm3,%xmm2 a8a: 49 8d 3c 87 lea (%r15,%rax,4),%rdi a8e: eb 02 jmp a92 <_ada_tst_array+0x3a2> a90: ff c6 inc %esi a92: 48 63 d6 movslq %esi,%rdx a95: 48 89 d0 mov %rdx,%rax a98: 48 29 e8 sub %rbp,%rax a9b: 4c 01 d8 add %r11,%rax a9e: f3 41 0f 10 0c 84 movss (%r12,%rax,4),%xmm1 aa4: 4c 89 c0 mov %r8,%rax aa7: 4c 29 d2 sub %r10,%rdx aaa: 45 31 c9 xor %r9d,%r9d aad: 48 83 c0 04 add $0x4,%rax ab1: 49 0f 48 c1 cmovs %r9,%rax ab5: 48 0f af d0 imul %rax,%rdx ab9: f3 0f 10 04 17 movss (%rdi,%rdx,1),%xmm0 abe: f3 0f 59 c1 mulss %xmm1,%xmm0 ac2: f3 0f 58 d0 addss %xmm0,%xmm2 ac6: 39 de cmp %ebx,%esi ac8: 75 c6 jne a90 <_ada_tst_array+0x3a0> aca: 48 8b 54 24 48 mov 0x48(%rsp),%rdx acf: 8b 02 mov (%rdx),%eax ad1: 48 63 d0 movslq %eax,%rdx ad4: 48 8b 7c 24 48 mov 0x48(%rsp),%rdi ad9: 8b 47 0c mov 0xc(%rdi),%eax adc: 48 63 f8 movslq %eax,%rdi adf: 4c 8b 7c 24 48 mov 0x48(%rsp),%r15 ae4: 41 8b 47 08 mov 0x8(%r15),%eax ae8: 48 98 cltq aea: 4c 8b 7c 24 70 mov 0x70(%rsp),%r15 aef: 48 29 c7 sub %rax,%rdi af2: 48 29 c1 sub %rax,%rcx af5: 48 8d 04 bd 04 00 00 lea 0x4(,%rdi,4),%rax afc: 00 afd: 49 29 d7 sub %rdx,%r15 b00: 48 85 c0 test %rax,%rax b03: 4c 89 fa mov %r15,%rdx b06: 49 0f 48 c1 cmovs %r9,%rax b0a: 48 0f af d0 imul %rax,%rdx b0e: 48 8b 84 24 88 00 00 mov 0x88(%rsp),%rax b15: 00 b16: 48 8d 0c 88 lea (%rax,%rcx,4),%rcx b1a: f3 0f 11 14 11 movss %xmm2,(%rcx,%rdx,1) b1f: 41 39 f5 cmp %esi,%r13d b22: 74 08 je b2c <_ada_tst_array+0x43c> b24: 41 ff c5 inc %r13d b27: e9 49 ff ff ff jmpq a75 <_ada_tst_array+0x385> b2c: 8b 54 24 2c mov 0x2c(%rsp),%edx b30: 39 54 24 68 cmp %edx,0x68(%rsp) b34: 74 09 je b3f <_ada_tst_array+0x44f> b36: ff 44 24 68 incl 0x68(%rsp) b3a: e9 cb fe ff ff jmpq a0a <_ada_tst_array+0x31a> end loop; end loop; Jeffrey Creem wrote: > Alinabi wrote: > > I ran your test programs compiled with gcc 4.0.3 and the following > > optimizations: > > COMMON_FLAGS=-g -march=opteron -mtune=opteron -mfpmath=sse > > -fomit-frame-pointer -O2 -fdump-tree-optimized > > and I cannot reproduce the large differences in performance everyone > > else talks about. Here are the times I get: > > > > N Ada Fortran > > ==================== > > 64 0.002029 0.000000 > > 128 0.016321 0.016000 > > 256 0.214143 0.204013 > > 512 3.125888 3.124195 > > 800 6.374982 5.864366 > > 1024 34.10479 35.22620 > > 2048 277.3071 283.2417 > > > > That is interesting. The question then becomes has FORTRAN improved on > the way to 4.2.0 or has Ada regressed. > > Try doing a make dis_all which should produce annotated assembly output. > The Ada version can be a little daunting in the way we have setup the > files since the generic instantiations at the top full the .S files > (woops, looked like I named them .dis) with the generic instatance. > > Even if you are not an assembly guru, if you start from the bottom of > the files you can usually pretty quickly find that inner loop and > compare the number of statements. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 0:02 ` Alinabi @ 2006-10-23 5:28 ` Gautier 2006-10-23 16:32 ` Alinabi 0 siblings, 1 reply; 68+ messages in thread From: Gautier @ 2006-10-23 5:28 UTC (permalink / raw) Alinabi: > -g -march=opteron -mtune=opteron -mfpmath=sse -fomit-frame-pointer -O2 > -fdump-tree-optimized Did you give -funroll-loops a try ? G. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 5:28 ` Gautier @ 2006-10-23 16:32 ` Alinabi 0 siblings, 0 replies; 68+ messages in thread From: Alinabi @ 2006-10-23 16:32 UTC (permalink / raw) I did, and it made no difference. I think no loop unrolling is done in the part of the code which is being timed, as the number of iterations is not known at compile time. Gautier wrote: > Alinabi: > > > -g -march=opteron -mtune=opteron -mfpmath=sse -fomit-frame-pointer -O2 > > -fdump-tree-optimized > > Did you give -funroll-loops a try ? > > G. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 3:03 ` Jeffrey Creem 2006-10-22 7:39 ` Jeffrey R. Carter 2006-10-22 13:50 ` Alinabi @ 2006-10-22 15:57 ` Jeffrey Creem 2006-10-22 19:32 ` Damien Carbonne 2 siblings, 1 reply; 68+ messages in thread From: Jeffrey Creem @ 2006-10-22 15:57 UTC (permalink / raw) Followup on the bug report. One of the comments asserted that the two programs were not equivilent though I am not yet 100% convinced that I believe it yet. His recommendation was to remove a level of indirection by changing the way the array is declared. N : Positive := Positive'Value (Argument (1)); G : Ada.Numerics.Float_Random.Generator; type Real_Matrix is array (1..N, 1..N) of Float; type Matrix_Access is access Real_Matrix; A,B,C : Matrix_Access; Start, Finish : Ada.Calendar.Time; Sum : Float := 0.0; begin A := new Real_Matrix; B := new Real_Matrix; C := new Real_Matrix; This does indeed substantially improve the performance (still not quite to FORTRAN levels). The reason I question the equilivence is that in the original version, the array could have been passed to a procedure that took in an unconstrained array (which is pretty much what I think the FORTRAN version would allow) while this new version would not do that. A quick check shows the FORTRAN is now only 1.2 times faster (did not do the multiple run thing yet). Perhaps tonight. I'll also attempt to take one of the threaded ones and include that as well. Can someone that understands FORTRAN better make an argument about the "closeness" of this approach v.s. the other? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 15:57 ` Jeffrey Creem @ 2006-10-22 19:32 ` Damien Carbonne 2006-10-22 20:00 ` Gautier ` (3 more replies) 0 siblings, 4 replies; 68+ messages in thread From: Damien Carbonne @ 2006-10-22 19:32 UTC (permalink / raw) Jeffrey Creem a �crit : > Followup on the bug report. > > One of the comments asserted that the two programs were not equivilent > though I am not yet 100% convinced that I believe it yet. > <snip> > > > Can someone that understands FORTRAN better make an argument about the > "closeness" of this approach v.s. the other? I've been following this thread for some times and I'm not a Fortran Guru,however, IIRC, Fortran arrays and Ada arrays don't have the same memory layout (Something like row major order and column major order). I've not compared programs of this thread with care, but I've the feeling that this could change things. Some years ago, I had to mix Ada and Fortran, and I remember that we had to invert loops order to obtain the same result. One would need to add 'pragma Convention (Fortran, Real_Matrix)' on Ada side to obtain the same result, or to exchange loops. My knowledge of Fortran was limited to Fortran 77, and I don't know if this is still valid with Fortran 95 or later. But if is still true, I would not consider the 2 programs as equivalent. Regards Damien Carbonne ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 19:32 ` Damien Carbonne @ 2006-10-22 20:00 ` Gautier 2006-10-22 20:51 ` Damien Carbonne 2006-10-23 1:31 ` Jeffrey Creem ` (2 subsequent siblings) 3 siblings, 1 reply; 68+ messages in thread From: Gautier @ 2006-10-22 20:00 UTC (permalink / raw) Damien Carbonne: > I've been following this thread for some times and I'm not a Fortran > Guru,however, IIRC, Fortran arrays and Ada arrays don't have the same > memory layout (Something like row major order and column major order). > I've not compared programs of this thread with care, but I've the > feeling that this could change things. > Some years ago, I had to mix Ada and Fortran, and I remember that we had > to invert loops order to obtain the same result. > One would need to add 'pragma Convention (Fortran, Real_Matrix)' on Ada > side to obtain the same result, or to exchange loops. > My knowledge of Fortran was limited to Fortran 77, and I don't know if > this is still valid with Fortran 95 or later. > But if is still true, I would not consider the 2 programs as equivalent. The question of Fortran's array storage has been treated earlier in this discussion, you just happened to add more noise by not reading enough posts ;-). Look at the post of Martin Krischik, 20.10.06 18:30 CET... it shows the most serious testing so far, with the exact sources used for testing. Cheers, Gautier ______________________________________________________________ Ada programming -- http://www.mysunrise.ch/users/gdm/gsoft.htm NB: For a direct answer, e-mail address on the Web site! ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 20:00 ` Gautier @ 2006-10-22 20:51 ` Damien Carbonne 2006-10-23 2:15 ` Jeffrey Creem 0 siblings, 1 reply; 68+ messages in thread From: Damien Carbonne @ 2006-10-22 20:51 UTC (permalink / raw) Gautier a �crit : > Damien Carbonne: <snip> > > > The question of Fortran's array storage has been treated earlier in this > discussion, you just happened to add more noise by not reading enough > posts ;-). Look at the post of Martin Krischik, 20.10.06 18:30 CET... it > shows the most serious testing so far, with the exact sources used for > testing. Sorry for this. I'll go back to my silent lazy listening after this message :-) S. Tardieu had also suggested this layout issue in one message and I didn't follow the links given by M. Krischik. Anyway, independently of compiler bugs or limitations, differences in matrix layout leads to different (non strictly equivalent) programs! Which was precisely a question asked by J. Creem in his first sentence ! Regards Damien ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 20:51 ` Damien Carbonne @ 2006-10-23 2:15 ` Jeffrey Creem 2006-10-23 2:29 ` Jeffrey R. Carter 0 siblings, 1 reply; 68+ messages in thread From: Jeffrey Creem @ 2006-10-23 2:15 UTC (permalink / raw) Damien Carbonne wrote: > Gautier a �crit : > >> Damien Carbonne: > > <snip> > >> >> >> The question of Fortran's array storage has been treated earlier in >> this discussion, you just happened to add more noise by not reading >> enough posts ;-). Look at the post of Martin Krischik, 20.10.06 18:30 >> CET... it shows the most serious testing so far, with the exact >> sources used for testing. > > Sorry for this. I'll go back to my silent lazy listening after this > message :-) S. Tardieu had also suggested this layout issue in one > message and I didn't follow the links given by M. Krischik. Anyway, > independently of compiler bugs or limitations, differences in matrix > layout leads to different (non strictly equivalent) programs! Which was > precisely a question asked by J. Creem in his first sentence ! > > Regards > Damien Actually, feel free to chime in even if you are not sure you are adding value. Heck, if we all waited until we were adding value, we'd never post anything :) Part problem here was my poor wording of the question. What I was asking is whether the two programs were sematically equivilent without dropping down to specific issues of storage representation. Specifically, one of the biggest speedups posted (and suggested by one person on the bugzilla list) changes the Ada program so it is no longer a pointer to an unconstrained array but rather a pointer to a constrained array where the constrained array type definition is determined at run time based on the size of the command line parameter. This may seem like the same thing but where I am asserting this falls apart is if someone wanted to write code that operated on variable sized arrays, this approach would probably not work (i.e. dynamic creation of a fixed length array). It appears to me that the FORTRAN version is using as dynamic an array size as possible and further that it is possible to pass these dynamically sized arrays to other procedures (for example matmul). Now, matmul may be some sort of special case because it appears to be a compiler intrinsic on most compilers. I've poked around the web resources a bit for FORTRAN 90/95 tutorials but I have not found the specific examples that would convince me I know what I am talking about. I guess what I am asking (or perhaps asserting) is that the original version of the Ada program is the closest to the FORTRAN version and not the one that creates the fixed length array since it appears to be that allocatable arrays of varying sizes can be passed to fortran subroutines. The original version of the Ada maintained that property. The modified version (and its derivatives that set the size in the type rather in the new) do not appear to me to be equivilent programs. The reason this is important is that in general I don't think we are trying to solve the "write the fastest program that can calculate a matrix multiply" but rather investigate the way in which fairly simple programming idioms are translated into assembly language and understand their runtime implications (at least that is what my goal has been in this). ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 2:15 ` Jeffrey Creem @ 2006-10-23 2:29 ` Jeffrey R. Carter 0 siblings, 0 replies; 68+ messages in thread From: Jeffrey R. Carter @ 2006-10-23 2:29 UTC (permalink / raw) Jeffrey Creem wrote: > > Specifically, one of the biggest speedups posted (and suggested by one > person on the bugzilla list) changes the Ada program so it is no longer > a pointer to an unconstrained array but rather a pointer to a > constrained array where the constrained array type definition is > determined at run time based on the size of the command line parameter. > > This may seem like the same thing but where I am asserting this falls > apart is if someone wanted to write code that operated on variable sized > arrays, this approach would probably not work (i.e. dynamic creation of > a fixed length array). You should be able to get the best of both worlds: N : constant Positive := Integer'Value (Argument (1) ); type Base_Matrix is array (Positive range <>, Positive range <>) of Float; subtype Real_Matrix is Base_Matrix (1 .. N, 1 .. N); -- Jeff Carter "I unclog my nose towards you." Monty Python & the Holy Grail 11 ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 19:32 ` Damien Carbonne 2006-10-22 20:00 ` Gautier @ 2006-10-23 1:31 ` Jeffrey Creem 2006-10-23 3:10 ` Jeffrey Creem 2006-10-23 12:33 ` Warner BRUNS 2006-10-23 12:40 ` Warner BRUNS 3 siblings, 1 reply; 68+ messages in thread From: Jeffrey Creem @ 2006-10-23 1:31 UTC (permalink / raw) Damien Carbonne wrote: > Jeffrey Creem a �crit : > >> Followup on the bug report. >> >> One of the comments asserted that the two programs were not equivilent >> though I am not yet 100% convinced that I believe it yet. >> > <snip> > >> >> >> Can someone that understands FORTRAN better make an argument about the >> "closeness" of this approach v.s. the other? > > > I've been following this thread for some times and I'm not a Fortran > Guru,however, IIRC, Fortran arrays and Ada arrays don't have the same > memory layout (Something like row major order and column major order). > I've not compared programs of this thread with care, but I've the > feeling that this could change things. > Some years ago, I had to mix Ada and Fortran, and I remember that we had > to invert loops order to obtain the same result. > One would need to add 'pragma Convention (Fortran, Real_Matrix)' on Ada > side to obtain the same result, or to exchange loops. > My knowledge of Fortran was limited to Fortran 77, and I don't know if > this is still valid with Fortran 95 or later. > But if is still true, I would not consider the 2 programs as equivalent. > > Regards > > Damien Carbonne > Yes, there is always the chance of row major/column major issues creeping in and causing dramatically different cache results however the original poster seemed to have tried the swap to no effect, and the code assembly code appears to show that this is/was not the problem. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 1:31 ` Jeffrey Creem @ 2006-10-23 3:10 ` Jeffrey Creem 2006-10-23 7:31 ` Jeffrey R. Carter 0 siblings, 1 reply; 68+ messages in thread From: Jeffrey Creem @ 2006-10-23 3:10 UTC (permalink / raw) Ok, a lot of suggestions have been posted including a few that I have not addressed or even investigated (specifically, the post that talked about the assembly language output from gcc 4.0.3 which makes this appear (to me) to be a regression. Having said that, I did update the wiki site with some timing numbers people asked for from a dual processor machine. http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison Scroll to the bottom and start reading at "Other Versions" if you are interested. This is getting too messy for plain text conversation. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 3:10 ` Jeffrey Creem @ 2006-10-23 7:31 ` Jeffrey R. Carter 2006-10-23 11:55 ` Jeffrey Creem 0 siblings, 1 reply; 68+ messages in thread From: Jeffrey R. Carter @ 2006-10-23 7:31 UTC (permalink / raw) Jeffrey Creem wrote: > > Having said that, I did update the wiki site with some timing numbers > people asked for from a dual processor machine. > > http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison > > Scroll to the bottom and start reading at "Other Versions" if you are > interested. This is getting too messy for plain text conversation. Wow, a greater than 50% speedup from using 2 tasks instead of 1 (from ada/tst_array to ada/tst_array_task_2). Something is odd there. This shows that, for those with dual-processor machines, easy-to-create Ada is faster than easy-to-create FORTRAN. This doesn't help those with single processors, of course. I doubt if anything will beat matmul (short of 640_000 processors). But Ada with explicit loops and FORTRAN with matmul are hardly equivalent. You need to convention-FORTRAN the Ada array type, import matmul, and call it to get a fair comparison. There shouldn't be much difference. -- Jeff Carter "I unclog my nose towards you." Monty Python & the Holy Grail 11 ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 7:31 ` Jeffrey R. Carter @ 2006-10-23 11:55 ` Jeffrey Creem 2006-10-23 19:52 ` Wiljan Derks 0 siblings, 1 reply; 68+ messages in thread From: Jeffrey Creem @ 2006-10-23 11:55 UTC (permalink / raw) Jeffrey R. Carter wrote: > Jeffrey Creem wrote: > >> >> Having said that, I did update the wiki site with some timing numbers >> people asked for from a dual processor machine. >> >> http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison >> >> >> Scroll to the bottom and start reading at "Other Versions" if you are >> interested. This is getting too messy for plain text conversation. > > > Wow, a greater than 50% speedup from using 2 tasks instead of 1 (from > ada/tst_array to ada/tst_array_task_2). Something is odd there. > > This shows that, for those with dual-processor machines, easy-to-create > Ada is faster than easy-to-create FORTRAN. > > This doesn't help those with single processors, of course. > > I doubt if anything will beat matmul (short of 640_000 processors). But > Ada with explicit loops and FORTRAN with matmul are hardly equivalent. > You need to convention-FORTRAN the Ada array type, import matmul, and > call it to get a fair comparison. There shouldn't be much difference. > I looked at the gcc FORTRAN matmul. Unless there some additional trickery going on behind the scenes, it is not anything magical. It looks like a matmul implemented in C with manual array subscripting logic (i.e. uses a single dimensional array overlay).. In any case, it is not so much matmul I am trying to make faster here but rather just the nature of 2d array traversals in native language structures. I just included the FORTRAN matmul to be "more than fair" to FORTRAN as I am in no way trying to bash FORTRAN. The speedup for the tasks is quite odd though. I'll need to disassemble it tonight. I also just finished a 4.0.2 install last night so I'll get those numbers to see if all of this mess is simply a regression someplace in the compiler. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 11:55 ` Jeffrey Creem @ 2006-10-23 19:52 ` Wiljan Derks 2006-10-23 20:25 ` Jeffrey R. Carter 2006-10-24 9:52 ` Dr. Adrian Wrigley 0 siblings, 2 replies; 68+ messages in thread From: Wiljan Derks @ 2006-10-23 19:52 UTC (permalink / raw) > Jeffrey R. Carter wrote: > I looked at the gcc FORTRAN matmul. Unless there some additional trickery > going on behind the scenes, it is not anything magical. It looks like a > matmul implemented in C with manual array subscripting logic (i.e. uses a > single dimensional array overlay).. When going for a faster algorithm this might be usefull information. Notice that this matmul uses an O(n^3) algorithm and it is possible to do this faster. I am not sure, but maybe O(n^2 log(n)) is reachable, which is much faster. It is easy to see that from the loops that the basic multiplications are done n times for the same pairs of numbers. In a similar way, the same additions are done multiple times. That means that there is real potential to improve things. For all small matrixes (let say 2 upto 4) one can write out all logic to multiply them and then optimize the code for those specific multiplications. This will give us fast multiplications for small matrixes. When having a larger matrix, is can be chopped into 4 smaller ones and these pieces can then be multiplied and added together. In total this can lead to a faster algorithm, although much more complex then this multipliction. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 19:52 ` Wiljan Derks @ 2006-10-23 20:25 ` Jeffrey R. Carter 2006-10-24 9:52 ` Dr. Adrian Wrigley 1 sibling, 0 replies; 68+ messages in thread From: Jeffrey R. Carter @ 2006-10-23 20:25 UTC (permalink / raw) Wiljan Derks wrote: >> Jeffrey R. Carter wrote: >> I looked at the gcc FORTRAN matmul. Unless there some additional trickery >> going on behind the scenes, it is not anything magical. It looks like a >> matmul implemented in C with manual array subscripting logic (i.e. uses a >> single dimensional array overlay).. For the record, this was Jeffrey Creem, not me. -- Jeff Carter "Blessed are they who convert their neighbors' oxen, for they shall inhibit their girth." Monty Python's Life of Brian 83 ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 19:52 ` Wiljan Derks 2006-10-23 20:25 ` Jeffrey R. Carter @ 2006-10-24 9:52 ` Dr. Adrian Wrigley 2006-10-24 11:50 ` Jeffrey Creem 2006-10-24 19:21 ` Wiljan Derks 1 sibling, 2 replies; 68+ messages in thread From: Dr. Adrian Wrigley @ 2006-10-24 9:52 UTC (permalink / raw) On Mon, 23 Oct 2006 21:52:59 +0200, Wiljan Derks wrote: >> Jeffrey R. Carter wrote: >> I looked at the gcc FORTRAN matmul. Unless there some additional trickery >> going on behind the scenes, it is not anything magical. It looks like a >> matmul implemented in C with manual array subscripting logic (i.e. uses a >> single dimensional array overlay).. > When going for a faster algorithm this might be usefull information. > Notice that this matmul uses an O(n^3) algorithm and it is possible to do > this faster. > I am not sure, but maybe O(n^2 log(n)) is reachable, which is much faster. > It is easy to see that from the loops that the basic multiplications are > done n times for the same pairs of numbers. I know you're quite correct that there is a faster algorithm for large n. But looking at the loops, I don't see where the same number pairs are multiplied at different points in the calculation. Doesn't A(1,1)*B(1,1) only appear in C(1,1)? Where else? > In a similar way, the same additions are done multiple times. > That means that there is real potential to improve things. Aren't they only the same additions if the same products are summed? I may be being particularly stupid today, but if the redundancy is obvious from the loops, can you help me see where! Thanks. -- Adrian ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-24 9:52 ` Dr. Adrian Wrigley @ 2006-10-24 11:50 ` Jeffrey Creem 2006-10-24 16:24 ` Jeffrey R. Carter 2006-10-24 19:21 ` Wiljan Derks 1 sibling, 1 reply; 68+ messages in thread From: Jeffrey Creem @ 2006-10-24 11:50 UTC (permalink / raw) Dr. Adrian Wrigley wrote: > On Mon, 23 Oct 2006 21:52:59 +0200, Wiljan Derks wrote: > > >>>Jeffrey R. Carter wrote: >>>I looked at the gcc FORTRAN matmul. Unless there some additional trickery >>>going on behind the scenes, it is not anything magical. It looks like a >>>matmul implemented in C with manual array subscripting logic (i.e. uses a >>>single dimensional array overlay).. >> >>When going for a faster algorithm this might be usefull information. >>Notice that this matmul uses an O(n^3) algorithm and it is possible to do >>this faster. >>I am not sure, but maybe O(n^2 log(n)) is reachable, which is much faster. >>It is easy to see that from the loops that the basic multiplications are >>done n times for the same pairs of numbers. > > > I know you're quite correct that there is a faster algorithm for > large n. But looking at the loops, I don't see where the same > number pairs are multiplied at different points in the calculation. > Doesn't A(1,1)*B(1,1) only appear in C(1,1)? Where else? > > >>In a similar way, the same additions are done multiple times. >>That means that there is real potential to improve things. > > > Aren't they only the same additions if the same products are summed? > > I may be being particularly stupid today, but if the redundancy > is obvious from the loops, can you help me see where! Thanks. > -- > Adrian > I have not looked at algorithms in this area in a few years but I do recall that you need very very large N before any of them start being worthwhile. There are some speedups possibly that stay N**3 but optimize fetches and stores for cache performance and do hand vectorization of portions of the inner loop that can achieve speedup. I would guess that even the library level matmul (which still has the best overall time) could be made about 2x faster on a recent intel platform with those changes. Back to the original discussion, there are quite a few interesting things going on. 1) I have confirmed what another poster wrote that 4.2.0 is substantially slower than 4.0.3 on the original Ada example (thus, this performance is a regression) 2) Small changes to the original code that still meet the original structure and intent of the original code can move the run time up and down by at least 50% 3) The two task based version of the code is running more than 2x faster than the single task version on a 2 processor machine. Some of this is from the two tasks but looking at the assembly, another portion of it is related to #2 above in that the re-arrangement of the math allows the compiler to get less brain dead. If I get a chance tonight, I'll post some update results and analysis of the generated code. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-24 11:50 ` Jeffrey Creem @ 2006-10-24 16:24 ` Jeffrey R. Carter 2006-10-25 3:50 ` Jeffrey Creem 0 siblings, 1 reply; 68+ messages in thread From: Jeffrey R. Carter @ 2006-10-24 16:24 UTC (permalink / raw) Jeffrey Creem wrote: > > 2) Small changes to the original code that still meet the original > structure and intent of the original code can move the run time up and > down by at least 50% > 3) The two task based version of the code is running more than 2x faster > than the single task version on a 2 processor machine. Some of this is > from the two tasks but looking at the assembly, another portion of it is > related to #2 above in that the re-arrangement of the math allows the > compiler to get less brain dead. These seem quite odd to me. Perhaps whatever is causing this is also the cause of the speedup I saw in the sequential case when the multiplication is put in a procedure. -- Jeff Carter "You've got the brain of a four-year-old boy, and I bet he was glad to get rid of it." Horse Feathers 47 ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-24 16:24 ` Jeffrey R. Carter @ 2006-10-25 3:50 ` Jeffrey Creem 2006-10-25 15:32 ` claude.simon 0 siblings, 1 reply; 68+ messages in thread From: Jeffrey Creem @ 2006-10-25 3:50 UTC (permalink / raw) Jeffrey R. Carter wrote: > Jeffrey Creem wrote: > >> >> 2) Small changes to the original code that still meet the original >> structure and intent of the original code can move the run time up and >> down by at least 50% >> 3) The two task based version of the code is running more than 2x >> faster than the single task version on a 2 processor machine. Some of >> this is from the two tasks but looking at the assembly, another >> portion of it is related to #2 above in that the re-arrangement of >> the math allows the compiler to get less brain dead. > > > These seem quite odd to me. Perhaps whatever is causing this is also the > cause of the speedup I saw in the sequential case when the > multiplication is put in a procedure. > Ok, we are all probably tired of taking about this topic but I posted what I think will be the final write-up on this (until the bugzilla issue for it is resolved). http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison And I agree that the > 2x speedup for the two task version is not real. There are now other versions that are coded in a similar maner to the 2 task version but which are single threaded. The 2 task version runs almost exactly 2x faster (sometimes slightly slower) on my dual processor machine. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-25 3:50 ` Jeffrey Creem @ 2006-10-25 15:32 ` claude.simon 0 siblings, 0 replies; 68+ messages in thread From: claude.simon @ 2006-10-25 15:32 UTC (permalink / raw) A way to have better performance is to optimize the use of the memory cache. An Ada writen matmul procedure to tend to the fortran intrinsic matmul speed with -O2 and -gnatp ada version with array size of 600 : time = 0.6722 fortran version with array size of 600 : time = 0.6094 package Array_Product is type Real_Matrix is array(Integer range <>,Integer range <>) of Float; type Matrix_Access is access Real_Matrix; procedure Matmul (A, B : in Real_Matrix; C : out Real_Matrix); end Array_Product; package body Array_Product is procedure Matmul (A, B : in Real_Matrix; C : out Real_Matrix) is BL : array (1 .. B'Length (1) * B'Length (2)) of Float; D : Natural := 0; Sum : Float := 0.0; begin D := 0; -- transposition gave the best speedup cache + only one indice Transpose_B_in_BL : for J in B'Range (1) loop for R in B'Range (2) loop BL( D + R) := B(R,J); end loop; D := D + B'Length (2); end loop Transpose_B_in_BL; for I in A'Range (1) loop declare -- give the cache best chance with A second speedup Ai : array (A'range (2)) of Float; begin for R in Ai'Range loop Ai (R) := A (I, R); end loop; D := 0; for J in A'range(2) loop Sum := 0.0; for R in A'Range (2) loop Sum := Sum + Ai(R)*BL(D + R); end loop; D := D + A'Length (2); C(I,J) := Sum; end loop; end; end loop; end Matmul; end Array_Product; Jeffrey Creem a écrit : > Jeffrey R. Carter wrote: > > Jeffrey Creem wrote: > > > >> > >> 2) Small changes to the original code that still meet the original > >> structure and intent of the original code can move the run time up and > >> down by at least 50% > >> 3) The two task based version of the code is running more than 2x > >> faster than the single task version on a 2 processor machine. Some of > >> this is from the two tasks but looking at the assembly, another > >> portion of it is related to #2 above in that the re-arrangement of > >> the math allows the compiler to get less brain dead. > > > > > > These seem quite odd to me. Perhaps whatever is causing this is also the > > cause of the speedup I saw in the sequential case when the > > multiplication is put in a procedure. > > > > Ok, we are all probably tired of taking about this topic but I posted > what I think will be the final write-up on this (until the bugzilla > issue for it is resolved). > > http://gnuada.sourceforge.net/pmwiki.php/Main/Oct2006CLAGFORTRANComparison > > And I agree that the > 2x speedup for the two task version is not real. > There are now other versions that are coded in a similar maner to the 2 > task version but which are single threaded. The 2 task version runs > almost exactly 2x faster (sometimes slightly slower) on my dual > processor machine. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-24 9:52 ` Dr. Adrian Wrigley 2006-10-24 11:50 ` Jeffrey Creem @ 2006-10-24 19:21 ` Wiljan Derks 1 sibling, 0 replies; 68+ messages in thread From: Wiljan Derks @ 2006-10-24 19:21 UTC (permalink / raw) "Dr. Adrian Wrigley" <amtw@linuxchip.demon.co.uk.uk.uk> wrote in message news:pan.2006.10.24.09.54.07.340442@linuxchip.demon.co.uk.uk.uk... > I know you're quite correct that there is a faster algorithm for > large n. But looking at the loops, I don't see where the same > number pairs are multiplied at different points in the calculation. > Doesn't A(1,1)*B(1,1) only appear in C(1,1)? Where else? > > Aren't they only the same additions if the same products are summed? > Sorry, I was to fast writing this remark. Since I was wrong, I checked my old school papers and found how to perfrom a faster matrix multiplication. It is called the Strassen algorithm. For more details see: http://en.wikipedia.org/wiki/Strassen_algorithm It still looks to me, there may be considerable savings since this algortihm is O(n^2.8) instead of O(n^3). Wikipedia also mention the drawback: numeric instability! So the subtraction and addition tricks probably lead to less accurate results in certain cases. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 19:32 ` Damien Carbonne 2006-10-22 20:00 ` Gautier 2006-10-23 1:31 ` Jeffrey Creem @ 2006-10-23 12:33 ` Warner BRUNS 2006-10-23 12:40 ` Warner BRUNS 3 siblings, 0 replies; 68+ messages in thread From: Warner BRUNS @ 2006-10-23 12:33 UTC (permalink / raw) Damien Carbonne wrote: > Jeffrey Creem a �crit : >> Followup on the bug report. >> >> One of the comments asserted that the two programs were not equivilent >> though I am not yet 100% convinced that I believe it yet. >> > <snip> >> >> >> Can someone that understands FORTRAN better make an argument about the >> "closeness" of this approach v.s. the other? > > I've been following this thread for some times and I'm not a Fortran > Guru,however, IIRC, Fortran arrays and Ada arrays don't have the same > memory layout (Something like row major order and column major order). > I've not compared programs of this thread with care, but I've the > feeling that this could change things. > Some years ago, I had to mix Ada and Fortran, and I remember that we had > to invert loops order to obtain the same result. > One would need to add 'pragma Convention (Fortran, Real_Matrix)' on Ada > side to obtain the same result, or to exchange loops. > My knowledge of Fortran was limited to Fortran 77, and I don't know if > this is still valid with Fortran 95 or later. > But if is still true, I would not consider the 2 programs as equivalent. > > Regards > > Damien Carbonne > A problem with Ada is that the memory layout of a twodimensional array is not specified in the Ada Standard. If you want to have a uniform memory layout between different Ada compilers, you have to specify that the memory layout should be like Fortran would do it. Very odd. TYPE DefReal IS DIGITS 12; TYPE DefReal2D IS ARRAY(DefInt RANGE <>, DefInt RANGE <>) OF DefReal; PRAGMA CONVENTION (Fortran, DefReal2D); Warner ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-22 19:32 ` Damien Carbonne ` (2 preceding siblings ...) 2006-10-23 12:33 ` Warner BRUNS @ 2006-10-23 12:40 ` Warner BRUNS 2006-10-23 13:52 ` Georg Bauhaus 2006-10-23 15:02 ` Robert A Duff 3 siblings, 2 replies; 68+ messages in thread From: Warner BRUNS @ 2006-10-23 12:40 UTC (permalink / raw) Damien Carbonne wrote: > Jeffrey Creem a �crit : >> Followup on the bug report. >> >> One of the comments asserted that the two programs were not equivilent >> though I am not yet 100% convinced that I believe it yet. >> > <snip> >> >> >> Can someone that understands FORTRAN better make an argument about the >> "closeness" of this approach v.s. the other? > > I've been following this thread for some times and I'm not a Fortran > Guru,however, IIRC, Fortran arrays and Ada arrays don't have the same > memory layout (Something like row major order and column major order). > I've not compared programs of this thread with care, but I've the > feeling that this could change things. > Some years ago, I had to mix Ada and Fortran, and I remember that we had > to invert loops order to obtain the same result. > One would need to add 'pragma Convention (Fortran, Real_Matrix)' on Ada > side to obtain the same result, or to exchange loops. > My knowledge of Fortran was limited to Fortran 77, and I don't know if > this is still valid with Fortran 95 or later. > But if is still true, I would not consider the 2 programs as equivalent. > > Regards > > Damien Carbonne > A problem with Ada is that the memory layout of multidimensional arrays is not specified in the standard. If you want to have the same memory layout for all Ada compilers, you have to specify that a multimensional array should follow the convention of a foreign language, eg. Fortran. TYPE DefReal IS DIGITS 12; TYPE DefReal2D IS ARRAY(Integer RANGE <>, Integer RANGE <>) OF DefReal; PRAGMA CONVENTION (Fortran, DefReal2D); There is something missing in the Ada standard. Warner ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 12:40 ` Warner BRUNS @ 2006-10-23 13:52 ` Georg Bauhaus 2006-10-23 17:11 ` Warner BRUNS 2006-10-23 15:02 ` Robert A Duff 1 sibling, 1 reply; 68+ messages in thread From: Georg Bauhaus @ 2006-10-23 13:52 UTC (permalink / raw) On Mon, 2006-10-23 at 14:40 +0200, Warner BRUNS wrote: > A problem with Ada is that the memory layout of multidimensional arrays > is not specified in the standard. If you want to have the same memory > layout for all Ada compilers, you have to specify that a multimensional > array should follow the convention of a foreign language, eg. Fortran. Will a memory layout cast in source really help with e.g. all Ada X Fortran combinations of compilers and libraries? It might disallow "knowledgeable" compilers to choose a layout that suits the current setup, I'd think, and require touching the sources. -- Georg ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 13:52 ` Georg Bauhaus @ 2006-10-23 17:11 ` Warner BRUNS 2006-10-23 17:57 ` Dr. Adrian Wrigley 0 siblings, 1 reply; 68+ messages in thread From: Warner BRUNS @ 2006-10-23 17:11 UTC (permalink / raw) Georg Bauhaus wrote: > On Mon, 2006-10-23 at 14:40 +0200, Warner BRUNS wrote: > > >> A problem with Ada is that the memory layout of multidimensional arrays >> is not specified in the standard. If you want to have the same memory >> layout for all Ada compilers, you have to specify that a multimensional >> array should follow the convention of a foreign language, eg. Fortran. > > Will a memory layout cast in source really help with e.g. all > Ada X Fortran combinations of compilers and libraries? > It might disallow "knowledgeable" compilers to choose a layout > that suits the current setup, I'd think, and require touching > the sources. > > > -- Georg > > It will help with all Ada/ Fortran Compiler combinations, but the really strange thing is that the memory layout is not definitely known from the Ada source alone, if it is not specified as being the layout as required by a different language, eg. Fortran. If one wants to code a cache aware algorithm for eg. multiplying matrices, one cannot do this in Ada without enforcing the memory layout of the matrices via PRAGMA CONVENTION. Warner ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 17:11 ` Warner BRUNS @ 2006-10-23 17:57 ` Dr. Adrian Wrigley 0 siblings, 0 replies; 68+ messages in thread From: Dr. Adrian Wrigley @ 2006-10-23 17:57 UTC (permalink / raw) On Mon, 23 Oct 2006 19:11:51 +0200, Warner BRUNS wrote: > Georg Bauhaus wrote: >> On Mon, 2006-10-23 at 14:40 +0200, Warner BRUNS wrote: >> >>> A problem with Ada is that the memory layout of multidimensional arrays >>> is not specified in the standard. If you want to have the same memory >>> layout for all Ada compilers, you have to specify that a multimensional >>> array should follow the convention of a foreign language, eg. Fortran. >> >> Will a memory layout cast in source really help with e.g. all >> Ada X Fortran combinations of compilers and libraries? >> It might disallow "knowledgeable" compilers to choose a layout >> that suits the current setup, I'd think, and require touching >> the sources. >> >> -- Georg >> > It will help with all Ada/ Fortran Compiler combinations, > but the really strange thing is that the memory layout is not > definitely known from the Ada source alone, if it is not specified > as being the layout as required by a different language, eg. Fortran. > If one wants to code a cache aware algorithm for eg. multiplying > matrices, one cannot do this in Ada without enforcing the memory > layout of the matrices via PRAGMA CONVENTION. A agree that it seems like an omission from the Ada standard. This specification should be available using a representation clause. Telling the compiler to allocate it like Fortran would is the wrong way, logically. Of course, if it really mattered, you could use an array of arrays (like in C), or an array of array accesses. With appropriate use of pragma representation you have a very high degree of flexibility. Compare this to almost any other language, and you find Ada leads the field by a mile. What if you wanted an alternative layout in Fortran? Or C++? Sometimes I'd like to see some intermingling of row/column ordering, so a page in memory holds a square sub-array of a 2-D array. The approach I'll take (if I get around to it!) is to use a 2-D array of acceses, allocated (by a pool allocator) with the desired pattern. Moral of the story: One Size does not Fit All... -- Adrian ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 12:40 ` Warner BRUNS 2006-10-23 13:52 ` Georg Bauhaus @ 2006-10-23 15:02 ` Robert A Duff 2006-10-23 20:22 ` Jeffrey R. Carter 1 sibling, 1 reply; 68+ messages in thread From: Robert A Duff @ 2006-10-23 15:02 UTC (permalink / raw) Warner BRUNS <Warner.Bruns@cern.ch> writes: > A problem with Ada is that the memory layout of multidimensional arrays > is not specified in the standard. If you want to have the same memory > layout for all Ada compilers, you have to specify that a multimensional > array should follow the convention of a foreign language, eg. Fortran. > TYPE DefReal IS DIGITS 12; > TYPE DefReal2D IS ARRAY(Integer RANGE <>, > Integer RANGE <>) OF DefReal; > PRAGMA CONVENTION (Fortran, DefReal2D); In practise, you can count on the fact that all Ada compilers will choose row-major order, unless you ask for column-major via the above pragma, even though the RM does not say so. > There is something missing in the Ada standard. Perhaps. - Bob ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-23 15:02 ` Robert A Duff @ 2006-10-23 20:22 ` Jeffrey R. Carter 0 siblings, 0 replies; 68+ messages in thread From: Jeffrey R. Carter @ 2006-10-23 20:22 UTC (permalink / raw) Robert A Duff wrote: > > In practise, you can count on the fact that all Ada compilers will > choose row-major order, unless you ask for column-major via the above > pragma, even though the RM does not say so. Because of the implementation advice in ARM 3.6.2.: "An implementation should normally represent multidimensional arrays in row-major order, consistent with the notation used for multidimensional array aggregates (see 4.3.3). However, if a pragma Convention(Fortran, ...) applies to a multidimensional array type, then column-major order should be used instead (see B.5, ``Interfacing with Fortran'')." -- Jeff Carter "Blessed are they who convert their neighbors' oxen, for they shall inhibit their girth." Monty Python's Life of Brian 83 ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-21 16:35 ` Jeffrey Creem 2006-10-21 17:04 ` Pascal Obry @ 2006-10-21 18:28 ` tmoran 1 sibling, 0 replies; 68+ messages in thread From: tmoran @ 2006-10-21 18:28 UTC (permalink / raw) > I'd certainly be willing to run a few benchmarks but the important thing > here is that rather innocent looking code is running 2-4x slower than it > "should". But: > > for Ada: "tst_array 800" runs in 18 seconds > > for Fortran "tst_array 800" runs in 6 seconds > > > > (if I use the fortran "matmul" intrinsic the fortran time drops to > > 2.5 seconds) Clearly the right way to do large matrix multiplies is to call matmul. I'd be surprised if it made any significant difference whether it was called from Ada or from Fortran. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-21 7:37 ` Gautier 2006-10-21 16:35 ` Jeffrey Creem @ 2006-10-23 6:28 ` Martin Krischik 1 sibling, 0 replies; 68+ messages in thread From: Martin Krischik @ 2006-10-23 6:28 UTC (permalink / raw) Gautier schrieb: > Jeffrey Creem: > >> Note, I am the first one to jump to the defense of "Ada" in general >> but in this case, GNAT just plain sucks compared to GNU FORTRAN as it >> does a poor job on (at least) the inner loop (verified by looking at >> the output assembly) > > There is something strange... Martin Krischik was able to trim the > overall time for the Ada code down to 24% of the first version (GNAT/GCC > 4.1.1). > This should make the Ada program as fast as the FORTRAN one, shouldn't it ? > Maybe it's because the test is done on a 64 bit machine ? A 64 bit system doesn't need -march -mcpu and -msse2 as desperately. Try >gcc -dumpmachine i686-pc-cygwin For this system i686 is default. Martin ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: GNAT compiler switches and optimization 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss ` (11 preceding siblings ...) 2006-10-20 22:11 ` Jeffrey R. Carter @ 2006-10-21 12:39 ` Dr. Adrian Wrigley 12 siblings, 0 replies; 68+ messages in thread From: Dr. Adrian Wrigley @ 2006-10-21 12:39 UTC (permalink / raw) On Fri, 20 Oct 2006 03:47:44 -0700, tkrauss wrote: > I'm a bit stuck trying to figure out how to coax more performance > out of some Ada code. I suspect there is something simple (like > compiler switches) but I'm missing it. As an example I'm using > a simple matrix multiply and comparing it to similar code in > Fortran. Unfortunately the Ada code takes 3-4 times as long. > > I'm using GNAT (GPL 2006) and GFortran (4.2.0) and the following > compile options: > > gnat make -O3 -gnatp tst_array > gfortran -O3 tst_array.f95 > > Running them on 800x800 matrices (on my 2GHz laptop) > > for Ada: "tst_array 800" runs in 18 seconds > for Fortran "tst_array 800" runs in 6 seconds > > (if I use the fortran "matmul" intrinsic the fortran time drops to > 2.5 seconds) > > Note, I tried reordering the loops, removing the random calls, etc. > none of which made huge changes. There is something killing > performance > and/or a switch or two that I'm missing, but I can't seem to find it. > Any > thoughts? When I started using Ada, I found exactly the same thing. Programs ran slower. Sometimes much slower. Perhaps this is the biggest single disadvantage of Ada (GNAT) in practice, particularly for heavy numerical codes (compared to Fortran or C). Looking closer, I found the assembly output was sometimes littered with apparently redundant or inintended function calls, extra checks and other baggage. The prevailing claims at the time were that Ada was roughly as fast as C, sometimes faster (because of deeper semantics). Whenever logically identical code was compared, however, the output assembly code was often identical, giving the same performance. In real code, I found big differences when using enumerations instead of integers, multi-dimansional arrays etc. And language-defined maths libraries, random number generators, I/O etc. all risked major slow-downs. The only solution I found was to examine the output code, and modify the source until I got code that was acceptable. Sometimes this meant dropping useful language features, or using less clear constructs. This is a real problem with the language (Ada using GNAT). For a lot of applications, the optimisation effort isn't worth it. And even for performance critical applications, most code is outside of any hot-spots. I get the impression that Fortran is an excellent language for getting efficient code easily. C is also quite good. But C++ and Ada both seem to "hand-holding" to keep them efficient. Perl is the worst I know(!) If you really care about performance, you'll check the assembly code or compare against expectations. As you fix unexpected bottlenecks, you'll find out what type of code compiles well with your compiler, and write future code avoiding the problem areas. Of course, when the compiler changes, you may find the rules change and your old code appears quaint or idiosyncratic. The other posters to this thread have given some useful optimisations to your original code. Let us know whether this bridges the performance gap for you! -- Adrian Wrigley, Cambridge, UK. ^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2006-10-25 15:32 UTC | newest] Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-10-20 10:47 GNAT compiler switches and optimization tkrauss 2006-10-20 11:04 ` Duncan Sands 2006-10-21 10:45 ` Stephen Leake 2006-10-20 11:42 ` Duncan Sands 2006-10-20 15:41 ` Martin Krischik 2006-10-20 12:09 ` Samuel Tardieu 2006-10-20 12:18 ` Samuel Tardieu 2006-10-20 12:12 ` Gautier 2006-10-20 12:35 ` Dmitry A. Kazakov 2006-10-20 15:53 ` Martin Krischik 2006-10-20 12:52 ` Gautier 2006-10-20 13:27 ` claude.simon 2006-10-20 15:38 ` Robert A Duff 2006-10-20 19:32 ` Gautier 2006-10-20 15:56 ` Jeffrey Creem 2006-10-20 16:30 ` Martin Krischik 2006-10-20 19:51 ` Gautier 2006-10-20 22:11 ` Jeffrey R. Carter 2006-10-20 23:52 ` Jeffrey Creem 2006-10-21 7:37 ` Gautier 2006-10-21 16:35 ` Jeffrey Creem 2006-10-21 17:04 ` Pascal Obry 2006-10-21 21:22 ` Jeffrey Creem 2006-10-22 3:03 ` Jeffrey Creem 2006-10-22 7:39 ` Jeffrey R. Carter 2006-10-22 11:48 ` tkrauss 2006-10-22 18:02 ` Georg Bauhaus 2006-10-22 18:24 ` Jeffrey Creem 2006-10-23 0:10 ` Georg Bauhaus 2006-10-22 20:20 ` Jeffrey R. Carter 2006-10-22 12:31 ` Gautier 2006-10-22 20:26 ` Jeffrey R. Carter 2006-10-22 21:22 ` Gautier 2006-10-22 18:01 ` tmoran 2006-10-22 20:54 ` Jeffrey R. Carter 2006-10-22 13:50 ` Alinabi 2006-10-22 15:41 ` Jeffrey Creem 2006-10-23 0:02 ` Alinabi 2006-10-23 5:28 ` Gautier 2006-10-23 16:32 ` Alinabi 2006-10-22 15:57 ` Jeffrey Creem 2006-10-22 19:32 ` Damien Carbonne 2006-10-22 20:00 ` Gautier 2006-10-22 20:51 ` Damien Carbonne 2006-10-23 2:15 ` Jeffrey Creem 2006-10-23 2:29 ` Jeffrey R. Carter 2006-10-23 1:31 ` Jeffrey Creem 2006-10-23 3:10 ` Jeffrey Creem 2006-10-23 7:31 ` Jeffrey R. Carter 2006-10-23 11:55 ` Jeffrey Creem 2006-10-23 19:52 ` Wiljan Derks 2006-10-23 20:25 ` Jeffrey R. Carter 2006-10-24 9:52 ` Dr. Adrian Wrigley 2006-10-24 11:50 ` Jeffrey Creem 2006-10-24 16:24 ` Jeffrey R. Carter 2006-10-25 3:50 ` Jeffrey Creem 2006-10-25 15:32 ` claude.simon 2006-10-24 19:21 ` Wiljan Derks 2006-10-23 12:33 ` Warner BRUNS 2006-10-23 12:40 ` Warner BRUNS 2006-10-23 13:52 ` Georg Bauhaus 2006-10-23 17:11 ` Warner BRUNS 2006-10-23 17:57 ` Dr. Adrian Wrigley 2006-10-23 15:02 ` Robert A Duff 2006-10-23 20:22 ` Jeffrey R. Carter 2006-10-21 18:28 ` tmoran 2006-10-23 6:28 ` Martin Krischik 2006-10-21 12:39 ` Dr. Adrian Wrigley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox