From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: 103376,7767a311e01e1cd
X-Google-Attributes: gid103376,public
X-Google-Language: ENGLISH,ASCII-7-bit
Path: 
 g2news2.google.com!news3.google.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local01.nntp.dca.giganews.com!nntp.comcast.com!news.comcast.com.POSTED!not-for-mail
NNTP-Posting-Date: Tue, 24 Oct 2006 06:52:10 -0500
Date: Tue, 24 Oct 2006 07:50:36 -0400
From: Jeffrey Creem <jeff@thecreems.com>
User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923)
X-Accept-Language: en-us, en
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: GNAT compiler switches and optimization
References: <1161341264.471057.252750@h48g2000cwc.googlegroups.com>
 <9Qb_g.111857$aJ.65708@attbi_s21> <434o04-7g7.ln1@newserver.thecreems.com>
 <4539ce34$1_2@news.bluewin.ch> <nrup04-5hj.ln1@newserver.thecreems.com>
 <453A532F.2070709@obry.net> <9kfq04-sgm.ln1@newserver.thecreems.com>
 <sj3r04-rlv.ln1@newserver.thecreems.com>
 <5vgs04-64f.ln1@newserver.thecreems.com>
 <453bc74e$0$19614$426a74cc@news.free.fr>
 <4jit04-0gq.ln1@newserver.thecreems.com>
 <jcot04-l2u.ln1@newserver.thecreems.com> <Ld__g.206207$FQ1.45478@attbi_s71>
 <c6nu04-m31.ln1@newserver.thecreems.com>
 <453d1d36$0$25551$bf4948fe@news.tele2.nl>
 <pan.2006.10.24.09.54.07.340442@linuxchip.demon.co.uk.uk.uk>
In-Reply-To: <pan.2006.10.24.09.54.07.340442@linuxchip.demon.co.uk.uk.uk>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Message-ID: <e8b114-7k1.ln1@newserver.thecreems.com>
NNTP-Posting-Host: 24.147.74.171
X-Trace: 
 sv3-sqen0SvGqapzc3cJeVnTTcfl4sBocAZyOz16NRmXXWGaa2vuCT9GevMb0/Ggm+sgUe8oyfq/X+x4LIF!08kptY6mWm3zCENc6DwwSJcmY6mnDCbsNirH0kCOcAnqNs3/uCu/MLB9fOslIUinKfRtNw7H5GvB!H20=
X-Complaints-To: abuse@comcast.net
X-DMCA-Complaints-To: dmca@comcast.net
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint
 properly
X-Postfilter: 1.3.32
Xref: g2news2.google.com comp.lang.ada:7180
Date: 2006-10-24T07:50:36-04:00
List-Id: <comp.lang.ada>

Dr. Adrian Wrigley wrote:
> On Mon, 23 Oct 2006 21:52:59 +0200, Wiljan Derks wrote:
> 
> 
>>>Jeffrey R. Carter wrote:
>>>I looked at the gcc FORTRAN matmul. Unless there some additional trickery 
>>>going on behind the scenes, it is not anything magical. It looks like a 
>>>matmul implemented in C with manual array subscripting logic (i.e. uses a 
>>>single dimensional array overlay)..
>>
>>When going for a faster algorithm this might be usefull information.
>>Notice that this matmul uses an O(n^3) algorithm and it is possible to do 
>>this faster.
>>I am not sure, but maybe O(n^2 log(n)) is reachable, which is much faster.
>>It is easy to see that from the loops that the basic multiplications are 
>>done n times for the same pairs of numbers.
> 
> 
> I know you're quite correct that there is a faster algorithm for
> large n.  But looking at the loops, I don't see where the same
> number pairs are multiplied at different points in the calculation.
> Doesn't A(1,1)*B(1,1) only appear in C(1,1)?  Where else?
> 
> 
>>In a similar way, the same additions are done multiple times.
>>That means that there is real potential to improve things.
> 
> 
> Aren't they only the same additions if the same products are summed?
> 
> I may be being particularly stupid today, but if the redundancy
> is obvious from the loops, can you help me see where!  Thanks.
> --
> Adrian
> 

I have not looked at algorithms in this area in a few years but I do 
recall that you need very very large N before any of them start being 
worthwhile.

There are some speedups possibly that stay N**3 but optimize fetches and 
stores for cache performance and do hand vectorization of portions of 
the inner loop that can achieve speedup.

I would guess that even the library level matmul (which still has the 
best overall time) could be made about 2x faster on a recent intel 
platform with those changes.


Back to the original discussion, there are quite a few interesting 
things going on.

1) I have confirmed what another poster wrote that 4.2.0 is 
substantially slower than 4.0.3 on the original Ada example (thus, this 
performance is a regression)
2) Small changes to the original code that still meet the original 
structure and intent of the original code can move the run time up and 
down by at least 50%
3) The two task based version of the code is running more than 2x faster 
than the single task version on a 2 processor machine. Some of this is 
from the two tasks but looking at the assembly, another portion of it is 
  related to #2 above in that the re-arrangement of the math allows the 
compiler to get less brain dead.


If I get a chance tonight, I'll post some update results and analysis of 
the generated code.