GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real

comp.lang.ada
 help / color / mirror / Atom feed

* GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
@ 2018-02-17 12:55 Bojan Bozovic
  2018-02-17 15:17 ` Bojan Bozovic
  2018-02-18  1:51 ` Bojan Bozovic
  0 siblings, 2 replies; 13+ messages in thread
From: Bojan Bozovic @ 2018-02-17 12:55 UTC (permalink / raw)


-- code

with Ada.Calendar;
with Ada.Text_IO;
with Ada.Numerics.Real_Arrays;
use Ada.Calendar;
use Ada.Text_IO;
use Ada.Numerics.Real_Arrays;

procedure Matrix_Mul is

   package F_IO is new Ada.Text_IO.Fixed_IO (Day_Duration);
   use F_IO;

   Start_Time, End_Time : Time;

   procedure Put (X : Real_Matrix) is
   begin
      for I in X'Range (1) loop
         for J in X'Range (2) loop
            Put (Float'Image (Float (X (I, J))));
         end loop;
         New_Line;
      end loop;
   end Put;

   Matrix_A, Matrix_B, Result : Real_Matrix (1 .. 4, 1 .. 4);
   Elapsed_Time               : Duration;
   Sum                        : Float;
begin
   Matrix_A :=
     ((1.0, 1.0, 1.0, 1.0),
      (2.0, 2.0, 2.0, 2.0),
      (3.0, 3.0, 3.0, 3.0),
      (4.0, 4.0, 4.0, 4.0));
   Matrix_B :=
     ((16.0, 15.0, 14.0, 13.0),
      (12.0, 11.0, 10.0, 9.0),
      (8.0, 7.0, 6.0, 5.0),
      (4.0, 3.0, 2.0, 1.0));

   Start_Time := Clock;
   for Iteration in 1 .. 10_000_000 loop
      Result := Matrix_A * Matrix_B;
   end loop;
   End_Time     := Clock;
   Elapsed_Time := End_Time - Start_Time;
   Put (Result);
   New_Line;
   Put ("Elapsed Time is ");
   Put (Elapsed_Time);
   New_Line;
   Start_Time := Clock;
   for Iteration in 1 .. 10_000_000 loop
      for I in Matrix_A'Range (1) loop
         for J in Matrix_A'Range (2) loop
            Sum := 0.0;
            for K in Matrix_A'Range (2) loop
               Sum := Sum + Matrix_A (I, K) * Matrix_B (K, J);
            end loop;
            Result (I, J) := Sum;
         end loop;
      end loop;
   end loop;
   End_Time     := Clock;
   Elapsed_Time := End_Time - Start_Time;
   Put (Result);
   New_Line;
   Put ("Elapsed time is ");
   Put (Elapsed_Time);
   New_Line;
end Matrix_Mul;
-- end code

Results: FSF GNAT 7.2.0 x64
C:\Users\Bojan\Documents>gnatmake -O3 -fopt-info -march=skylake matrix_mul.adb -largs -s
gcc -c -O3 -fopt-info -march=skylake matrix_mul.adb
matrix_mul.adb:56:41: note: loop turned into non-loop; it never loops.
matrix_mul.adb:56:41: note: loop with 4 iterations completely unrolled
matrix_mul.adb:54:38: note: loop turned into non-loop; it never loops.
matrix_mul.adb:54:38: note: loop with 4 iterations completely unrolled
matrix_mul.adb:53:35: note: loop turned into non-loop; it never loops.
matrix_mul.adb:53:35: note: loop with 4 iterations completely unrolled
matrix_mul.adb:40:15: note: basic block vectorized
matrix_mul.adb:42:26: note: basic block vectorized
matrix_mul.adb:18:31: note: basic block vectorized
matrix_mul.adb:49:9: note: basic block vectorized
C:/MSYS64/MINGW64/lib/gcc/x86_64-w64-mingw32/7.2.0/adainclude/a-tifiio.adb:706:10: note: basic block vectorized
matrix_mul.adb:64:17: note: basic block vectorized
matrix_mul.adb:18:31: note: basic block vectorized
matrix_mul.adb:68:9: note: basic block vectorized
C:/MSYS64/MINGW64/lib/gcc/x86_64-w64-mingw32/7.2.0/adainclude/a-tifiio.adb:706:10: note: basic block vectorized
gnatbind -x matrix_mul.ali
gnatlink matrix_mul.ali -O3 -fopt-info -march=skylake -s

C:\Users\Bojan\Documents>matrix_mul
 4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
 8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
 1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
 1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02

Elapsed Time is      1.338206667
 4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
 8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
 1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
 1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02

Elapsed time is      0.000000445

C:\Users\Bojan\Documents>set path=C:\GNAT\2017\BIN;%path%

Results GPL GNAT/2017 from AdaCore.

C:\Users\Bojan\Documents>gnatmake -O3 -fopt-info -mavx2 matrix_mul.adb -largs -s
gcc -c -O3 -fopt-info -mavx2 matrix_mul.adb
matrix_mul.adb:56:41: note: loop turned into non-loop; it never loops.
matrix_mul.adb:56:41: note: loop with 4 iterations completely unrolled
matrix_mul.adb:54:38: note: loop turned into non-loop; it never loops.
matrix_mul.adb:54:38: note: loop with 4 iterations completely unrolled
matrix_mul.adb:53:35: note: loop turned into non-loop; it never loops.
matrix_mul.adb:53:35: note: loop with 4 iterations completely unrolled
matrix_mul.adb:40:15: note: basic block vectorized
matrix_mul.adb:68:9: note: basic block vectorized
gnatbind -x matrix_mul.ali
gnatlink matrix_mul.ali -O3 -fopt-info -mavx2 -s

C:\Users\Bojan\Documents>matrix_mul
 4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
 8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
 1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
 1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02

Elapsed Time is      2.145337334
 4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
 8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
 1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
 1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02

Elapsed time is      0.000000444

C:\Users\Bojan\Documents>gcc --version
gcc (GCC) 6.3.1 20170510 (for GNAT GPL 2017 20170515)
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
See your AdaCore support agreement for details of warranty and support.
If you do not have a current support agreement, then there is absolutely
no warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.

Should I submit this as bug to AdaCore? My computer is Intel Core i3-6100U (Skylake AVX2). Please try to reproduce.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
  2018-02-17 12:55 GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise! Bojan Bozovic
@ 2018-02-17 15:17 ` Bojan Bozovic
  2018-02-17 15:49   ` Bojan Bozovic
  2018-02-18  1:51 ` Bojan Bozovic
  1 sibling, 1 reply; 13+ messages in thread
From: Bojan Bozovic @ 2018-02-17 15:17 UTC (permalink / raw)


On Saturday, February 17, 2018 at 1:55:49 PM UTC+1, Bojan Bozovic wrote:
> -- code
> 
> with Ada.Calendar;
> with Ada.Text_IO;
> with Ada.Numerics.Real_Arrays;
> use Ada.Calendar;
> use Ada.Text_IO;
> use Ada.Numerics.Real_Arrays;
> 
> procedure Matrix_Mul is
> 
>    package F_IO is new Ada.Text_IO.Fixed_IO (Day_Duration);
>    use F_IO;
> 
>    Start_Time, End_Time : Time;
> 
>    procedure Put (X : Real_Matrix) is
>    begin
>       for I in X'Range (1) loop
>          for J in X'Range (2) loop
>             Put (Float'Image (Float (X (I, J))));
>          end loop;
>          New_Line;
>       end loop;
>    end Put;
> 
>    Matrix_A, Matrix_B, Result : Real_Matrix (1 .. 4, 1 .. 4);
>    Elapsed_Time               : Duration;
>    Sum                        : Float;
> begin
>    Matrix_A :=
>      ((1.0, 1.0, 1.0, 1.0),
>       (2.0, 2.0, 2.0, 2.0),
>       (3.0, 3.0, 3.0, 3.0),
>       (4.0, 4.0, 4.0, 4.0));
>    Matrix_B :=
>      ((16.0, 15.0, 14.0, 13.0),
>       (12.0, 11.0, 10.0, 9.0),
>       (8.0, 7.0, 6.0, 5.0),
>       (4.0, 3.0, 2.0, 1.0));
> 
>    Start_Time := Clock;
>    for Iteration in 1 .. 10_000_000 loop
>       Result := Matrix_A * Matrix_B;
>    end loop;
>    End_Time     := Clock;
>    Elapsed_Time := End_Time - Start_Time;
>    Put (Result);
>    New_Line;
>    Put ("Elapsed Time is ");
>    Put (Elapsed_Time);
>    New_Line;
>    Start_Time := Clock;
>    for Iteration in 1 .. 10_000_000 loop
>       for I in Matrix_A'Range (1) loop
>          for J in Matrix_A'Range (2) loop
>             Sum := 0.0;
>             for K in Matrix_A'Range (2) loop
>                Sum := Sum + Matrix_A (I, K) * Matrix_B (K, J);
>             end loop;
>             Result (I, J) := Sum;
>          end loop;
>       end loop;
>    end loop;
>    End_Time     := Clock;
>    Elapsed_Time := End_Time - Start_Time;
>    Put (Result);
>    New_Line;
>    Put ("Elapsed time is ");
>    Put (Elapsed_Time);
>    New_Line;
> end Matrix_Mul;
> -- end code
> 
> Results: FSF GNAT 7.2.0 x64
> C:\Users\Bojan\Documents>gnatmake -O3 -fopt-info -march=skylake matrix_mul.adb -largs -s
> gcc -c -O3 -fopt-info -march=skylake matrix_mul.adb
> matrix_mul.adb:56:41: note: loop turned into non-loop; it never loops.
> matrix_mul.adb:56:41: note: loop with 4 iterations completely unrolled
> matrix_mul.adb:54:38: note: loop turned into non-loop; it never loops.
> matrix_mul.adb:54:38: note: loop with 4 iterations completely unrolled
> matrix_mul.adb:53:35: note: loop turned into non-loop; it never loops.
> matrix_mul.adb:53:35: note: loop with 4 iterations completely unrolled
> matrix_mul.adb:40:15: note: basic block vectorized
> matrix_mul.adb:42:26: note: basic block vectorized
> matrix_mul.adb:18:31: note: basic block vectorized
> matrix_mul.adb:49:9: note: basic block vectorized
> C:/MSYS64/MINGW64/lib/gcc/x86_64-w64-mingw32/7.2.0/adainclude/a-tifiio.adb:706:10: note: basic block vectorized
> matrix_mul.adb:64:17: note: basic block vectorized
> matrix_mul.adb:18:31: note: basic block vectorized
> matrix_mul.adb:68:9: note: basic block vectorized
> C:/MSYS64/MINGW64/lib/gcc/x86_64-w64-mingw32/7.2.0/adainclude/a-tifiio.adb:706:10: note: basic block vectorized
> gnatbind -x matrix_mul.ali
> gnatlink matrix_mul.ali -O3 -fopt-info -march=skylake -s
> 
> C:\Users\Bojan\Documents>matrix_mul
>  4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
>  8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
>  1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
>  1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02
> 
> Elapsed Time is      1.338206667
>  4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
>  8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
>  1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
>  1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02
> 
> Elapsed time is      0.000000445
> 
> C:\Users\Bojan\Documents>set path=C:\GNAT\2017\BIN;%path%
> 
> Results GPL GNAT/2017 from AdaCore.
> 
> C:\Users\Bojan\Documents>gnatmake -O3 -fopt-info -mavx2 matrix_mul.adb -largs -s
> gcc -c -O3 -fopt-info -mavx2 matrix_mul.adb
> matrix_mul.adb:56:41: note: loop turned into non-loop; it never loops.
> matrix_mul.adb:56:41: note: loop with 4 iterations completely unrolled
> matrix_mul.adb:54:38: note: loop turned into non-loop; it never loops.
> matrix_mul.adb:54:38: note: loop with 4 iterations completely unrolled
> matrix_mul.adb:53:35: note: loop turned into non-loop; it never loops.
> matrix_mul.adb:53:35: note: loop with 4 iterations completely unrolled
> matrix_mul.adb:40:15: note: basic block vectorized
> matrix_mul.adb:68:9: note: basic block vectorized
> gnatbind -x matrix_mul.ali
> gnatlink matrix_mul.ali -O3 -fopt-info -mavx2 -s
> 
> C:\Users\Bojan\Documents>matrix_mul
>  4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
>  8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
>  1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
>  1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02
> 
> Elapsed Time is      2.145337334
>  4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
>  8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
>  1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
>  1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02
> 
> Elapsed time is      0.000000444
> 
> C:\Users\Bojan\Documents>gcc --version
> gcc (GCC) 6.3.1 20170510 (for GNAT GPL 2017 20170515)
> Copyright (C) 2016 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.
> See your AdaCore support agreement for details of warranty and support.
> If you do not have a current support agreement, then there is absolutely
> no warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
> PURPOSE.
> 
> Should I submit this as bug to AdaCore? My computer is Intel Core i3-6100U (Skylake AVX2). Please try to reproduce.

Compiler is smart enough to not iterate 10 million times over constant values, but there is still optimization problem.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
  2018-02-17 15:17 ` Bojan Bozovic
@ 2018-02-17 15:49   ` Bojan Bozovic
  0 siblings, 0 replies; 13+ messages in thread
From: Bojan Bozovic @ 2018-02-17 15:49 UTC (permalink / raw)


On Saturday, February 17, 2018 at 4:17:41 PM UTC+1, Bojan Bozovic wrote:
> On Saturday, February 17, 2018 at 1:55:49 PM UTC+1, Bojan Bozovic wrote:
> > -- code
> > 
> > with Ada.Calendar;
> > with Ada.Text_IO;
> > with Ada.Numerics.Real_Arrays;
> > use Ada.Calendar;
> > use Ada.Text_IO;
> > use Ada.Numerics.Real_Arrays;
> > 
> > procedure Matrix_Mul is
> > 
> >    package F_IO is new Ada.Text_IO.Fixed_IO (Day_Duration);
> >    use F_IO;
> > 
> >    Start_Time, End_Time : Time;
> > 
> >    procedure Put (X : Real_Matrix) is
> >    begin
> >       for I in X'Range (1) loop
> >          for J in X'Range (2) loop
> >             Put (Float'Image (Float (X (I, J))));
> >          end loop;
> >          New_Line;
> >       end loop;
> >    end Put;
> > 
> >    Matrix_A, Matrix_B, Result : Real_Matrix (1 .. 4, 1 .. 4);
> >    Elapsed_Time               : Duration;
> >    Sum                        : Float;
> > begin
> >    Matrix_A :=
> >      ((1.0, 1.0, 1.0, 1.0),
> >       (2.0, 2.0, 2.0, 2.0),
> >       (3.0, 3.0, 3.0, 3.0),
> >       (4.0, 4.0, 4.0, 4.0));
> >    Matrix_B :=
> >      ((16.0, 15.0, 14.0, 13.0),
> >       (12.0, 11.0, 10.0, 9.0),
> >       (8.0, 7.0, 6.0, 5.0),
> >       (4.0, 3.0, 2.0, 1.0));
> > 
> >    Start_Time := Clock;
> >    for Iteration in 1 .. 10_000_000 loop
> >       Result := Matrix_A * Matrix_B;
> >    end loop;
> >    End_Time     := Clock;
> >    Elapsed_Time := End_Time - Start_Time;
> >    Put (Result);
> >    New_Line;
> >    Put ("Elapsed Time is ");
> >    Put (Elapsed_Time);
> >    New_Line;
> >    Start_Time := Clock;
> >    for Iteration in 1 .. 10_000_000 loop
> >       for I in Matrix_A'Range (1) loop
> >          for J in Matrix_A'Range (2) loop
> >             Sum := 0.0;
> >             for K in Matrix_A'Range (2) loop
> >                Sum := Sum + Matrix_A (I, K) * Matrix_B (K, J);
> >             end loop;
> >             Result (I, J) := Sum;
> >          end loop;
> >       end loop;
> >    end loop;
> >    End_Time     := Clock;
> >    Elapsed_Time := End_Time - Start_Time;
> >    Put (Result);
> >    New_Line;
> >    Put ("Elapsed time is ");
> >    Put (Elapsed_Time);
> >    New_Line;
> > end Matrix_Mul;
> > -- end code
> > 
> > Results: FSF GNAT 7.2.0 x64
> > C:\Users\Bojan\Documents>gnatmake -O3 -fopt-info -march=skylake matrix_mul.adb -largs -s
> > gcc -c -O3 -fopt-info -march=skylake matrix_mul.adb
> > matrix_mul.adb:56:41: note: loop turned into non-loop; it never loops.
> > matrix_mul.adb:56:41: note: loop with 4 iterations completely unrolled
> > matrix_mul.adb:54:38: note: loop turned into non-loop; it never loops.
> > matrix_mul.adb:54:38: note: loop with 4 iterations completely unrolled
> > matrix_mul.adb:53:35: note: loop turned into non-loop; it never loops.
> > matrix_mul.adb:53:35: note: loop with 4 iterations completely unrolled
> > matrix_mul.adb:40:15: note: basic block vectorized
> > matrix_mul.adb:42:26: note: basic block vectorized
> > matrix_mul.adb:18:31: note: basic block vectorized
> > matrix_mul.adb:49:9: note: basic block vectorized
> > C:/MSYS64/MINGW64/lib/gcc/x86_64-w64-mingw32/7.2.0/adainclude/a-tifiio.adb:706:10: note: basic block vectorized
> > matrix_mul.adb:64:17: note: basic block vectorized
> > matrix_mul.adb:18:31: note: basic block vectorized
> > matrix_mul.adb:68:9: note: basic block vectorized
> > C:/MSYS64/MINGW64/lib/gcc/x86_64-w64-mingw32/7.2.0/adainclude/a-tifiio.adb:706:10: note: basic block vectorized
> > gnatbind -x matrix_mul.ali
> > gnatlink matrix_mul.ali -O3 -fopt-info -march=skylake -s
> > 
> > C:\Users\Bojan\Documents>matrix_mul
> >  4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
> >  8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
> >  1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
> >  1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02
> > 
> > Elapsed Time is      1.338206667
> >  4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
> >  8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
> >  1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
> >  1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02
> > 
> > Elapsed time is      0.000000445
> > 
> > C:\Users\Bojan\Documents>set path=C:\GNAT\2017\BIN;%path%
> > 
> > Results GPL GNAT/2017 from AdaCore.
> > 
> > C:\Users\Bojan\Documents>gnatmake -O3 -fopt-info -mavx2 matrix_mul.adb -largs -s
> > gcc -c -O3 -fopt-info -mavx2 matrix_mul.adb
> > matrix_mul.adb:56:41: note: loop turned into non-loop; it never loops.
> > matrix_mul.adb:56:41: note: loop with 4 iterations completely unrolled
> > matrix_mul.adb:54:38: note: loop turned into non-loop; it never loops.
> > matrix_mul.adb:54:38: note: loop with 4 iterations completely unrolled
> > matrix_mul.adb:53:35: note: loop turned into non-loop; it never loops.
> > matrix_mul.adb:53:35: note: loop with 4 iterations completely unrolled
> > matrix_mul.adb:40:15: note: basic block vectorized
> > matrix_mul.adb:68:9: note: basic block vectorized
> > gnatbind -x matrix_mul.ali
> > gnatlink matrix_mul.ali -O3 -fopt-info -mavx2 -s
> > 
> > C:\Users\Bojan\Documents>matrix_mul
> >  4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
> >  8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
> >  1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
> >  1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02
> > 
> > Elapsed Time is      2.145337334
> >  4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
> >  8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
> >  1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
> >  1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02
> > 
> > Elapsed time is      0.000000444
> > 
> > C:\Users\Bojan\Documents>gcc --version
> > gcc (GCC) 6.3.1 20170510 (for GNAT GPL 2017 20170515)
> > Copyright (C) 2016 Free Software Foundation, Inc.
> > This is free software; see the source for copying conditions.
> > See your AdaCore support agreement for details of warranty and support.
> > If you do not have a current support agreement, then there is absolutely
> > no warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
> > PURPOSE.
> > 
> > Should I submit this as bug to AdaCore? My computer is Intel Core i3-6100U (Skylake AVX2). Please try to reproduce.
> 
> Compiler is smart enough to not iterate 10 million times over constant values, but there is still optimization problem.

I don't know how to use pragma Volatile to force  iterations, even though with no 10 million iterations but with single multiplication results are (from AdaCore GNAT/2017 GPL)

C:\Users\Bojan\Documents>matrix_mul
 4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
 8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
 1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
 1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02

Elapsed Time is      0.000009333
 4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
 8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
 1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
 1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02

Elapsed time is      0.000000889

I'm submitting this as bug to AdaCore.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
  2018-02-17 12:55 GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise! Bojan Bozovic
  2018-02-17 15:17 ` Bojan Bozovic
@ 2018-02-18  1:51 ` Bojan Bozovic
  2018-02-18 10:35   ` Jeffrey R. Carter
  1 sibling, 1 reply; 13+ messages in thread
From: Bojan Bozovic @ 2018-02-18  1:51 UTC (permalink / raw)


On Saturday, February 17, 2018 at 1:55:49 PM UTC+1, Bojan Bozovic wrote:
> -- code
> 
> with Ada.Calendar;
> with Ada.Text_IO;
> with Ada.Numerics.Real_Arrays;
> use Ada.Calendar;
> use Ada.Text_IO;
> use Ada.Numerics.Real_Arrays;
> 
> procedure Matrix_Mul is
> 
>    package F_IO is new Ada.Text_IO.Fixed_IO (Day_Duration);
>    use F_IO;
> 
>    Start_Time, End_Time : Time;
> 
>    procedure Put (X : Real_Matrix) is
>    begin
>       for I in X'Range (1) loop
>          for J in X'Range (2) loop
>             Put (Float'Image (Float (X (I, J))));
>          end loop;
>          New_Line;
>       end loop;
>    end Put;
> 
>    Matrix_A, Matrix_B, Result : Real_Matrix (1 .. 4, 1 .. 4);
>    Elapsed_Time               : Duration;
>    Sum                        : Float;
> begin
>    Matrix_A :=
>      ((1.0, 1.0, 1.0, 1.0),
>       (2.0, 2.0, 2.0, 2.0),
>       (3.0, 3.0, 3.0, 3.0),
>       (4.0, 4.0, 4.0, 4.0));
>    Matrix_B :=
>      ((16.0, 15.0, 14.0, 13.0),
>       (12.0, 11.0, 10.0, 9.0),
>       (8.0, 7.0, 6.0, 5.0),
>       (4.0, 3.0, 2.0, 1.0));
> 
>    Start_Time := Clock;
>    for Iteration in 1 .. 10_000_000 loop
>       Result := Matrix_A * Matrix_B;
>    end loop;
>    End_Time     := Clock;
>    Elapsed_Time := End_Time - Start_Time;
>    Put (Result);
>    New_Line;
>    Put ("Elapsed Time is ");
>    Put (Elapsed_Time);
>    New_Line;
>    Start_Time := Clock;
>    for Iteration in 1 .. 10_000_000 loop
>       for I in Matrix_A'Range (1) loop
>          for J in Matrix_A'Range (2) loop
>             Sum := 0.0;
>             for K in Matrix_A'Range (2) loop
>                Sum := Sum + Matrix_A (I, K) * Matrix_B (K, J);
>             end loop;
>             Result (I, J) := Sum;
>          end loop;
>       end loop;
>    end loop;
>    End_Time     := Clock;
>    Elapsed_Time := End_Time - Start_Time;
>    Put (Result);
>    New_Line;
>    Put ("Elapsed time is ");
>    Put (Elapsed_Time);
>    New_Line;
> end Matrix_Mul;
> -- end code
> 
> Results: FSF GNAT 7.2.0 x64
> C:\Users\Bojan\Documents>gnatmake -O3 -fopt-info -march=skylake matrix_mul.adb -largs -s
> gcc -c -O3 -fopt-info -march=skylake matrix_mul.adb
> matrix_mul.adb:56:41: note: loop turned into non-loop; it never loops.
> matrix_mul.adb:56:41: note: loop with 4 iterations completely unrolled
> matrix_mul.adb:54:38: note: loop turned into non-loop; it never loops.
> matrix_mul.adb:54:38: note: loop with 4 iterations completely unrolled
> matrix_mul.adb:53:35: note: loop turned into non-loop; it never loops.
> matrix_mul.adb:53:35: note: loop with 4 iterations completely unrolled
> matrix_mul.adb:40:15: note: basic block vectorized
> matrix_mul.adb:42:26: note: basic block vectorized
> matrix_mul.adb:18:31: note: basic block vectorized
> matrix_mul.adb:49:9: note: basic block vectorized
> C:/MSYS64/MINGW64/lib/gcc/x86_64-w64-mingw32/7.2.0/adainclude/a-tifiio.adb:706:10: note: basic block vectorized
> matrix_mul.adb:64:17: note: basic block vectorized
> matrix_mul.adb:18:31: note: basic block vectorized
> matrix_mul.adb:68:9: note: basic block vectorized
> C:/MSYS64/MINGW64/lib/gcc/x86_64-w64-mingw32/7.2.0/adainclude/a-tifiio.adb:706:10: note: basic block vectorized
> gnatbind -x matrix_mul.ali
> gnatlink matrix_mul.ali -O3 -fopt-info -march=skylake -s
> 
> C:\Users\Bojan\Documents>matrix_mul
>  4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
>  8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
>  1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
>  1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02
> 
> Elapsed Time is      1.338206667
>  4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
>  8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
>  1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
>  1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02
> 
> Elapsed time is      0.000000445
> 
> C:\Users\Bojan\Documents>set path=C:\GNAT\2017\BIN;%path%
> 
> Results GPL GNAT/2017 from AdaCore.
> 
> C:\Users\Bojan\Documents>gnatmake -O3 -fopt-info -mavx2 matrix_mul.adb -largs -s
> gcc -c -O3 -fopt-info -mavx2 matrix_mul.adb
> matrix_mul.adb:56:41: note: loop turned into non-loop; it never loops.
> matrix_mul.adb:56:41: note: loop with 4 iterations completely unrolled
> matrix_mul.adb:54:38: note: loop turned into non-loop; it never loops.
> matrix_mul.adb:54:38: note: loop with 4 iterations completely unrolled
> matrix_mul.adb:53:35: note: loop turned into non-loop; it never loops.
> matrix_mul.adb:53:35: note: loop with 4 iterations completely unrolled
> matrix_mul.adb:40:15: note: basic block vectorized
> matrix_mul.adb:68:9: note: basic block vectorized
> gnatbind -x matrix_mul.ali
> gnatlink matrix_mul.ali -O3 -fopt-info -mavx2 -s
> 
> C:\Users\Bojan\Documents>matrix_mul
>  4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
>  8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
>  1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
>  1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02
> 
> Elapsed Time is      2.145337334
>  4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
>  8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
>  1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
>  1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02
> 
> Elapsed time is      0.000000444
> 
> C:\Users\Bojan\Documents>gcc --version
> gcc (GCC) 6.3.1 20170510 (for GNAT GPL 2017 20170515)
> Copyright (C) 2016 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.
> See your AdaCore support agreement for details of warranty and support.
> If you do not have a current support agreement, then there is absolutely
> no warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
> PURPOSE.
> 
> Should I submit this as bug to AdaCore? My computer is Intel Core i3-6100U (Skylake AVX2). Please try to reproduce.

Apart from response that they don't offer support on GPL versions of compiler, I got no response. I hope this will be useful to someone.

with Ada.Calendar;
with Ada.Text_IO;
with Ada.Numerics.Real_Arrays;
use Ada.Calendar;
use Ada.Text_IO;
use Ada.Numerics.Real_Arrays;

procedure Matrix_Mul is

   package F_IO is new Ada.Text_IO.Fixed_IO (Day_Duration);
   use F_IO;

   Start_Time, End_Time : Time;
   
   procedure Put (X : Real_Matrix) is
   begin
      for I in X'Range (1) loop
         for J in X'Range (2) loop
            Put (Float'Image (Float (X (I, J))));
         end loop;
         New_Line;
      end loop;
   end Put;


   Matrix_A, Matrix_B, Result : Real_Matrix (1 .. 4, 1 .. 4);
   Elapsed_Time               : Duration;
   Sum                        : Float;
   pragma Volatile (Matrix_A);
   pragma Volatile (Matrix_B);
   pragma Volatile (Result);
   Iterations : constant := 10_000_000;
begin
   Matrix_A :=
     ((1.0, 1.0, 1.0, 1.0),
      (2.0, 2.0, 2.0, 2.0),
      (3.0, 3.0, 3.0, 3.0),
      (4.0, 4.0, 4.0, 4.0));
   Matrix_B :=
     ((16.0, 15.0, 14.0, 13.0),
      (12.0, 11.0, 10.0, 9.0),
      (8.0, 7.0, 6.0, 5.0),
      (4.0, 3.0, 2.0, 1.0));

   Start_Time := Clock;
   for Iteration in 1..Iterations loop
      Result := Matrix_A * Matrix_B;
	end loop;
   End_Time     := Clock;
   Elapsed_Time := End_Time - Start_Time;
   Put (Result);
   New_Line;
   Put ("Elapsed Time is ");
   Put (Elapsed_Time);
   New_Line;
   Start_Time := Clock;
   for Iteration in 1 .. Iterations loop
   
      for I in Matrix_A'Range (1) loop
         for J in Matrix_A'Range (2) loop
            Sum := 0.0;
            for K in Matrix_A'Range (2) loop
               Sum := Sum + Matrix_A (I, K) * Matrix_B (K, J);
            end loop;
            Result (I, J) := Sum;
         end loop;
      end loop;
end loop;
   End_Time     := Clock;
   Elapsed_Time := End_Time - Start_Time;
   Put (Result);
   New_Line;
   Put ("Elapsed time is ");
   Put (Elapsed_Time);
   New_Line;
end Matrix_Mul;

-- end code

C:\Users\Bojan\Documents>gnatmake -O3 -mavx2 matrix_mul -largs -s
gcc -c -O3 -mavx2 matrix_mul.adb
gnatbind -x matrix_mul.ali
gnatlink matrix_mul.ali -O3 -mavx2 -s

C:\Users\Bojan\Documents>matrix_mul
 4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
 8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
 1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
 1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02

Elapsed Time is      2.134868889
 4.00000E+01 3.60000E+01 3.20000E+01 2.80000E+01
 8.00000E+01 7.20000E+01 6.40000E+01 5.60000E+01
 1.20000E+02 1.08000E+02 9.60000E+01 8.40000E+01
 1.60000E+02 1.44000E+02 1.28000E+02 1.12000E+02

Elapsed time is      0.337315111

This is from GNAT/2017 compiler with these switches above.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
  2018-02-18  1:51 ` Bojan Bozovic
@ 2018-02-18 10:35   ` Jeffrey R. Carter
  2018-02-18 12:05     ` Bojan Bozovic
  0 siblings, 1 reply; 13+ messages in thread
From: Jeffrey R. Carter @ 2018-02-18 10:35 UTC (permalink / raw)

On 02/18/2018 02:51 AM, Bojan Bozovic wrote:
> 
> Apart from response that they don't offer support on GPL versions of compiler, I got no response. I hope this will be useful to someone.

Nor should you expect anything to happen, since you have not found an error. An 
error is when the compiler accepts illegal code, rejects legal code, or produces 
object code that gives incorrect results. At best you have found an opportunity 
for better optimization. Even that seems unlikely.

Remember that Ada.Numerics.Real_Arrays."*" is a general-purpose library 
function. As such, it has to do things that your specific implementation of 
matrix multiplication doesn't: It has to check that Left'Length (2) = 
Right'Length (1) and raise an exception if they're not equal. Your code doesn't, 
and any such check should be optimized away because its condition will be 
statically False. The general code has to handle the case where Left'First (2) 
/= Right'First (1). Even when the offset is zero, that's still an extra addition 
in the inner loop.

Also, Ada.Numerics.Real_Arrays is provided precompiled, and is surely compiled 
with different optimization options than your code. (IIRC, -O3 is considered 
experimental, and AdaCore is not going to compile its library code with 
experimental optimization.)

You can find GNAT's implementation of matrix multiplication as 
System.Generic_Array_Operations.Matrix_Matrix_Product. This is extremely 
general, allowing for non-numeric matrix components (complex numbers, for 
example), and different component types for the left, right, and result 
matrices. This may introduce additional barriers to optimization.

Since Ada.Numerics.Real_Arrays is an instantiation of 
Ada.Numerics.Generic_Real_Arrays for Float, and GNAT does macro expansion of 
generics, if you use an explicit instantiation of 
Ada.Numerics.Generic_Real_Arrays in your code, it should be compiled with your 
compiler options and thus remove that variable from your comparison. Doing that 
in your code, and building with

gnatmake -O3 -mavx2 matrix_mul -largs -s

cuts the reported time for "*" by a factor of about 4. Not surprisingly, still 
slower. YMMV

-- 
Jeff Carter
Just as Khan was hindered by two-dimensional thinking in a
three-dimensional situation, so many developers are hindered
by sequential thinking in concurrent situations.
118

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
  2018-02-18 10:35   ` Jeffrey R. Carter
@ 2018-02-18 12:05     ` Bojan Bozovic
  2018-02-18 13:31       ` Jeffrey R. Carter
  0 siblings, 1 reply; 13+ messages in thread
From: Bojan Bozovic @ 2018-02-18 12:05 UTC (permalink / raw)


Thanks very much for clarification! It's always good to learn something new, I suppose.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
  2018-02-18 12:05     ` Bojan Bozovic
@ 2018-02-18 13:31       ` Jeffrey R. Carter
  2018-02-18 19:38         ` Bojan Bozovic
  0 siblings, 1 reply; 13+ messages in thread
From: Jeffrey R. Carter @ 2018-02-18 13:31 UTC (permalink / raw)


On 02/18/2018 01:05 PM, Bojan Bozovic wrote:
> Thanks very much for clarification! It's always good to learn something new, I suppose.

The real optimization in your example seems to be that the compiler optimizes 
away the 10E6 loop around the in-line code, but not around the call to "*". 
Removing both loops gives similar times for both multiplications. The call to 
"*" will never be as fast because it copies its result into your variable. 
Replacing Ada.Numerics.Real_Arrays with an instantiation of 
Ada.Numerics.Generic_Real_Arrays gives an additional factor of 2 reduction.

-- 
Jeff Carter
Just as Khan was hindered by two-dimensional thinking in a
three-dimensional situation, so many developers are hindered
by sequential thinking in concurrent situations.
118

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
  2018-02-18 13:31       ` Jeffrey R. Carter
@ 2018-02-18 19:38         ` Bojan Bozovic
  2018-02-18 21:48           ` Nasser M. Abbasi
  0 siblings, 1 reply; 13+ messages in thread
From: Bojan Bozovic @ 2018-02-18 19:38 UTC (permalink / raw)

On Sunday, February 18, 2018 at 2:31:18 PM UTC+1, Jeffrey R. Carter wrote:
> On 02/18/2018 01:05 PM, Bojan Bozovic wrote:
> > Thanks very much for clarification! It's always good to learn something new, I suppose.
> 
> The real optimization in your example seems to be that the compiler optimizes 
> away the 10E6 loop around the in-line code, but not around the call to "*". 
> Removing both loops gives similar times for both multiplications. The call to 
> "*" will never be as fast because it copies its result into your variable. 
> Replacing Ada.Numerics.Real_Arrays with an instantiation of 
> Ada.Numerics.Generic_Real_Arrays gives an additional factor of 2 reduction.
> 
> -- 
> Jeff Carter
> Just as Khan was hindered by two-dimensional thinking in a
> three-dimensional situation, so many developers are hindered
> by sequential thinking in concurrent situations.
> 118

And indeed, now everything is in its place, as I don't see matrix multiplication several times slower than doing things 'by hand'. I have used aggressive optimization because in this day its custom a program to manipulate over gigabytes or even terabytes of data, while being at most megabytes in size - so compiling for various processors of the same family and using launcher program which will query processor capability and run optimally optimized program is nothing new - one might say most computation now is done on GPU but then also exact capabilities of GPU must be known. Thanks for making my Ada learning experience an enjoyment!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
  2018-02-18 19:38         ` Bojan Bozovic
@ 2018-02-18 21:48           ` Nasser M. Abbasi
  2018-02-18 22:50             ` Bojan Bozovic
  2018-02-19 21:08             ` Robert Eachus
  0 siblings, 2 replies; 13+ messages in thread
From: Nasser M. Abbasi @ 2018-02-18 21:48 UTC (permalink / raw)


On 2/18/2018 1:38 PM, Bojan Bozovic wrote:

If you are doing A*B by hand, then you are doing something
wrong. Almost all languages end up calling Blas
Fortran libraries for these operations. Your code and
the Ada code can't be faster.

http://www.netlib.org/blas/

Intel Math Kernel Library has all these.

https://en.wikipedia.org/wiki/Math_Kernel_Library

--Nasser


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
  2018-02-18 21:48           ` Nasser M. Abbasi
@ 2018-02-18 22:50             ` Bojan Bozovic
  2018-02-19 21:08             ` Robert Eachus
  1 sibling, 0 replies; 13+ messages in thread
From: Bojan Bozovic @ 2018-02-18 22:50 UTC (permalink / raw)

On Sunday, February 18, 2018 at 10:48:42 PM UTC+1, Nasser M. Abbasi wrote:
> On 2/18/2018 1:38 PM, Bojan Bozovic wrote:
> 
> If you are doing A*B by hand, then you are doing something
> wrong. Almost all languages end up calling Blas
> Fortran libraries for these operations. Your code and
> the Ada code can't be faster.
> 
> http://www.netlib.org/blas/
> 
> Intel Math Kernel Library has all these.
> 
> https://en.wikipedia.org/wiki/Math_Kernel_Library
> 
> --Nasser

Well I wanted to compare how Ada would do against C simply in 4x4 matrix multiplication, and I was surprised to see several times slower results, so I tried then to code 'by hand' to attempt to achieve the same speed with Ada (and the speed is comparable). I'm just learning the language and so I was unaware of it's finesses (such as making new instance of Ada.Numerics.Generic_Real_Arrays). Thanks for the links. I will have to bookmark them.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
  2018-02-18 21:48           ` Nasser M. Abbasi
  2018-02-18 22:50             ` Bojan Bozovic
@ 2018-02-19 21:08             ` Robert Eachus
  2018-02-20  2:31               ` Bojan Bozovic
  1 sibling, 1 reply; 13+ messages in thread
From: Robert Eachus @ 2018-02-19 21:08 UTC (permalink / raw)

On Sunday, February 18, 2018 at 4:48:42 PM UTC-5, Nasser M. Abbasi wrote:
> On 2/18/2018 1:38 PM, Bojan Bozovic wrote:
> 
> If you are doing A*B by hand, then you are doing something
> wrong. Almost all languages end up calling Blas
> Fortran libraries for these operations. Your code and
> the Ada code can't be faster.
> 
> http://www.netlib.org/blas/
> 
> Intel Math Kernel Library has all these.
> 
> https://en.wikipedia.org/wiki/Math_Kernel_Library

For multiplying two small matrices, blas is overkill and will be slower.  If you have say, 1000x1000 matrices, then you should be using blas.  But which BLAS?  Intel and AMD both have math libraries optimized for their CPUs.  However, I tend to use ATLAS.  ATLAS will build a blas targeted at your specific hardware.  This is not just about instruction set additions like SIMD2.  It will tailor the implementation to your number of cores and supported threads, cache sizes, and memory speeds.  I've also used the goto blas, but ATLAS even though not perfect, builds all of blas3 using matrix multiplication and blas2, such that all operations slower than O(n^2) have their speed determined by matrix multiplication.  (Then use multiple matrix multiplication codes with different parameters to find the fastest.)

Usually hardware vendor libraries catch up to and surpass ATLAS, but by then the hardware is obsolete. :-(   The other problem right now is that blas libraries are pretty dumb when it comes to multiprocessor systems.  I'm working on fixing that. ;-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
  2018-02-19 21:08             ` Robert Eachus
@ 2018-02-20  2:31               ` Bojan Bozovic
  2018-02-26  6:58                 ` Robert Eachus
  0 siblings, 1 reply; 13+ messages in thread
From: Bojan Bozovic @ 2018-02-20  2:31 UTC (permalink / raw)


On Monday, February 19, 2018 at 10:08:41 PM UTC+1, Robert Eachus wrote:
> On Sunday, February 18, 2018 at 4:48:42 PM UTC-5, Nasser M. Abbasi wrote:
> > On 2/18/2018 1:38 PM, Bojan Bozovic wrote:
> > 
> > If you are doing A*B by hand, then you are doing something
> > wrong. Almost all languages end up calling Blas
> > Fortran libraries for these operations. Your code and
> > the Ada code can't be faster.
> > 
> > http://www.netlib.org/blas/
> > 
> > Intel Math Kernel Library has all these.
> > 
> > https://en.wikipedia.org/wiki/Math_Kernel_Library
> 
> For multiplying two small matrices, blas is overkill and will be slower.  If you have say, 1000x1000 matrices, then you should be using blas.  But which BLAS?  Intel and AMD both have math libraries optimized for their CPUs.  However, I tend to use ATLAS.  ATLAS will build a blas targeted at your specific hardware.  This is not just about instruction set additions like SIMD2.  It will tailor the implementation to your number of cores and supported threads, cache sizes, and memory speeds.  I've also used the goto blas, but ATLAS even though not perfect, builds all of blas3 using matrix multiplication and blas2, such that all operations slower than O(n^2) have their speed determined by matrix multiplication.  (Then use multiple matrix multiplication codes with different parameters to find the fastest.)
> 
> Usually hardware vendor libraries catch up to and surpass ATLAS, but by then the hardware is obsolete. :-(   The other problem right now is that blas libraries are pretty dumb when it comes to multiprocessor systems.  I'm working on fixing that. ;-)

I have looked at ATLAS, however it can't spawn more threads than specified at compile time, so there's lots of possibility to optimize there, by spawning as many threads as supported at run-time. Ada would do much better here than C, because you could make portable code without resorting to ugly hacks of C, and using parallelism no matter whats the underlying processor architecture. That are my $0.02, worthless or not (and if you want to use assembler to "optimize" further in C, that can be done in any language, which I fear Intel MKL library and other vendor libraries do).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise!
  2018-02-20  2:31               ` Bojan Bozovic
@ 2018-02-26  6:58                 ` Robert Eachus
  0 siblings, 0 replies; 13+ messages in thread
From: Robert Eachus @ 2018-02-26  6:58 UTC (permalink / raw)


On Monday, February 19, 2018 at 9:31:26 PM UTC-5, Bojan Bozovic 

> I have looked at ATLAS, however it can't spawn more threads than specified at compile time, so there's lots of possibility to optimize there, by spawning as many threads as supported at run-time.

Aarrg!  Yes, there is a lot of work that needs to be done.  The intent is that you run ATLAS on your target environment, then use the best result as your blas library.

But the problem I am fighting with right now is that on the most recent (high-end) processors from Intel and AMD, you never want to use as many threads as the hardware tells you are available at run-time.  In fact, it is common that if you have a processor which supports 8-threads, you want to run four threads on all even or odd numbered threads.  The recent Threadripper and EPYC CPUs from AMD make it even more complex, as do any multisocket systems.  Usually you want to split the problem up completely and duplicate the data on each hardware CPU chip.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-02-26  6:58 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-17 12:55 GNAT can't vectorize Real_Matrix multiplication from Ada.Numerics.Real_Arrays. What a surprise! Bojan Bozovic
2018-02-17 15:17 ` Bojan Bozovic
2018-02-17 15:49   ` Bojan Bozovic
2018-02-18  1:51 ` Bojan Bozovic
2018-02-18 10:35   ` Jeffrey R. Carter
2018-02-18 12:05     ` Bojan Bozovic
2018-02-18 13:31       ` Jeffrey R. Carter
2018-02-18 19:38         ` Bojan Bozovic
2018-02-18 21:48           ` Nasser M. Abbasi
2018-02-18 22:50             ` Bojan Bozovic
2018-02-19 21:08             ` Robert Eachus
2018-02-20  2:31               ` Bojan Bozovic
2018-02-26  6:58                 ` Robert Eachus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox