From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: 103376,7767a311e01e1cd
X-Google-Attributes: gid103376,public
X-Google-Language: ENGLISH,ASCII-7-bit
Path: 
 g2news2.google.com!news3.google.com!newsfeed2.dallas1.level3.net!news.level3.com!newsfeed-00.mathworks.com!newscon02.news.prodigy.net!prodigy.net!wns14feed!worldnet.att.net!attbi_s22.POSTED!53ab2750!not-for-mail
From: "Jeffrey R. Carter" <spam.not.jrcarter@acm.not.spam.org>
Organization: jrcarter at acm dot org
User-Agent: Thunderbird 1.5.0.7 (Windows/20060909)
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: GNAT compiler switches and optimization
References: <1161341264.471057.252750@h48g2000cwc.googlegroups.com>
 <9Qb_g.111857$aJ.65708@attbi_s21> <434o04-7g7.ln1@newserver.thecreems.com>
 <4539ce34$1_2@news.bluewin.ch> <nrup04-5hj.ln1@newserver.thecreems.com>
 <453A532F.2070709@obry.net> <9kfq04-sgm.ln1@newserver.thecreems.com>
 <sj3r04-rlv.ln1@newserver.thecreems.com>
In-Reply-To: <sj3r04-rlv.ln1@newserver.thecreems.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Message-ID: <OeF_g.1031657$084.91539@attbi_s22>
NNTP-Posting-Host: 12.201.97.213
X-Complaints-To: abuse@mchsi.com
X-Trace: attbi_s22 1161502766 12.201.97.213 (Sun, 22 Oct 2006 07:39:26 GMT)
NNTP-Posting-Date: Sun, 22 Oct 2006 07:39:26 GMT
Date: Sun, 22 Oct 2006 07:39:27 GMT
Xref: g2news2.google.com comp.lang.ada:7125
Date: 2006-10-22T07:39:27+00:00
List-Id: <comp.lang.ada>

Jeffrey Creem wrote:
> 
> Actually, as a result of this, I submitted a bug report to the GCC 
> bugzilla list. You can follow progress on it here:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29543
> 
> Interesting initial feedback is that
> 1) Not an Ada bug.
> 2) Is a FORTRAN bug
> 3) Is a backend limitation of the optimizer.
> 
> Of course, the FORTRAN one still runs correctly so I don't think most 
> users will care that it is because of a bug :)

Interesting. I've been experimenting with some variations simply out of 
curiosity and found some things that seem a bit strange. (All results 
for an argument of 800.)

Adding the Sum variable makes an important difference, as others have 
reported, in my case from 5.82 to 4.38 s. Hoisting the indexing 
calculation for the result (C) matrix location is a basic optimization, 
and I would be surprised if it isn't done. The only thing I can think of 
is that it's a cache issue: that all 3 matrices can't be kept in cache 
at once. Perhaps compiler writers would be able to make sense of this.

Previously, I found no difference between -O2 and -O3. With this change, 
-O2 is faster.

The issue of using 'range compared to using "1 .. N" makes no difference 
in my version of the program.

Something I found really surprising is that putting the multiplication 
in a procedure makes the program faster, down to 4.03 s. I have no idea 
why this would be so.

Compiled with MinGW GNAT 3.4.2, -O2, -gnatnp -fomit-frame-pointer. Run 
under Windows XP SP2 on a 3.2 GHz Pentium 4 HT with 1 GB RAM.

Here's the code:

with Ada.Numerics.Float_Random;
with Ada.Command_Line;          use Ada.Command_Line;
with Ada.Text_IO;               use Ada.Text_IO;
with Ada.Calendar;              use Ada.Calendar;

procedure Tst_Array is
    package F_IO is new Ada.Text_IO.Float_IO (Float);
    package D_IO is new Ada.Text_Io.Fixed_Io (Duration);

    N : constant Positive := Integer'Value (Argument (1) );

    type Real_Matrix is array (1 .. N, 1 .. N) of Float;
    pragma Convention (FORTRAN, Real_Matrix);

    G : Ada.Numerics.Float_Random.Generator;

    A,B : Real_Matrix :=
       (others => (others => Ada.Numerics.Float_Random.Random (G) ) );
    C : Real_Matrix := (others => (others => 0.0) );
    Start, Finish : Ada.Calendar.Time;

    procedure Multiply is
       Sum : Float;
    begin -- Multiply
       All_Rows : for Row in A'range (1) loop
          All_Columns : for Column in B'range (2) loop
             Sum := 0.0;

             All_Common : for R in A'range (2) loop
                Sum := Sum + A (Row, R) * B (R, Column);
             end loop All_Common;

             C (Row, Column) := Sum;
          end loop All_Columns;
       end loop All_Rows;
    end Multiply;
begin
    Start := Ada.Calendar.Clock;
    Multiply;
    Finish := Ada.Calendar.Clock;

    F_IO.Put (C (1, 1) );
    F_IO.Put (C (1, 2) );
    New_Line;
    F_IO.Put (C (2, 1) );
    F_IO.Put (C (2, 2) );
    New_Line;

    Put ("Time: ");
    D_IO.Put (Finish - Start);
    New_Line;
end Tst_Array;

Next, since there have been reported some meaningful speed-up of quick 
sort on a Pentium 4 HT processor by using 2 tasks, I thought I'd see 
what effect that had. With 2 tasks, I got a time of 3.70 s. That's not a 
significant speed up, about 9.1%.

Same compilation options and platform.

Here's that code:

with Ada.Numerics.Float_Random;
with Ada.Command_Line;          use Ada.Command_Line;
with Ada.Text_IO;               use Ada.Text_IO;
with Ada.Calendar;              use Ada.Calendar;

procedure Tst_Array is
    package F_IO is new Ada.Text_IO.Float_IO (Float);
    package D_IO is new Ada.Text_Io.Fixed_Io (Duration);

    N : constant Positive := Integer'Value (Argument (1) );

    type Real_Matrix is array (1 .. N, 1 .. N) of Float;
    pragma Convention (FORTRAN, Real_Matrix);

    G : Ada.Numerics.Float_Random.Generator;

    A, B : Real_Matrix :=
       (others => (others => Ada.Numerics.Float_Random.Random (G) ) );
    C : Real_Matrix := (others => (others => 0.0) );
    Start, Finish : Ada.Calendar.Time;

    procedure Multiply is
       procedure Multiply
          (Start_Row : in Positive; Stop_Row : in Positive)
       is
          Sum : Float;
       begin -- Multiply
          All_Rows : for Row in Start_Row .. Stop_Row loop
             All_Columns : for Column in B'range (2) loop
                Sum := 0.0;

                All_Common : for R in A'range (2) loop
                   Sum := Sum + A (Row, R) * B (R, Column);
                end loop All_Common;

                C (Row, Column) := Sum;
             end loop All_Columns;
          end loop All_Rows;
       end Multiply;

       task type Multiplier (Start_Row : Positive; Stop_Row : Positive);

       task body Multiplier is
          -- null;
       begin -- Multiplier
          Multiply (Start_Row => Start_Row, Stop_Row => Stop_Row);
       end Multiplier;

       Stop  : constant Positive := N / 2;
       Start : constant Positive := Stop + 1;

       Mult : Multiplier (Start_Row => 1, Stop_Row => Stop);
    begin -- Multiply
       Multiply (Start_Row => Start, Stop_Row => N);
    end Multiply;
begin
    Start := Ada.Calendar.Clock;
    Multiply;
    Finish := Ada.Calendar.Clock;

    F_IO.Put (C (1, 1) );
    F_IO.Put (C (1, 2) );
    New_Line;
    F_IO.Put (C (2, 1) );
    F_IO.Put (C (2, 2) );
    New_Line;

    Put ("Time: ");
    D_IO.Put (Finish - Start);
    New_Line;
end Tst_Array;

If I inline the inner Multiply, or put equivalent code in the task and 
the outer Mutliply, the time is much more than for the sequential 
version, presumably due to cache effects.

Since it appears you have 2 physical processors ("Dual Xeon 2.8 Ghz"), I 
would be interested in seeing what effect this concurrent version has on 
that platform. I also wonder how easy such a version would be to create 
in FORTRAN.

-- 
Jeff Carter
"Ada has made you lazy and careless. You can write programs in C that
are just as safe by the simple application of super-human diligence."
E. Robert Tisdale
72