From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Thread: 103376,7767a311e01e1cd X-Google-Attributes: gid103376,public X-Google-Language: ENGLISH,ASCII-7-bit Path: g2news2.google.com!news3.google.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local01.nntp.dca.giganews.com!nntp.comcast.com!news.comcast.com.POSTED!not-for-mail NNTP-Posting-Date: Tue, 24 Oct 2006 06:52:10 -0500 Date: Tue, 24 Oct 2006 07:50:36 -0400 From: Jeffrey Creem User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923) X-Accept-Language: en-us, en MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: GNAT compiler switches and optimization References: <1161341264.471057.252750@h48g2000cwc.googlegroups.com> <9Qb_g.111857$aJ.65708@attbi_s21> <434o04-7g7.ln1@newserver.thecreems.com> <4539ce34$1_2@news.bluewin.ch> <453A532F.2070709@obry.net> <9kfq04-sgm.ln1@newserver.thecreems.com> <5vgs04-64f.ln1@newserver.thecreems.com> <453bc74e$0$19614$426a74cc@news.free.fr> <4jit04-0gq.ln1@newserver.thecreems.com> <453d1d36$0$25551$bf4948fe@news.tele2.nl> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Message-ID: NNTP-Posting-Host: 24.147.74.171 X-Trace: sv3-sqen0SvGqapzc3cJeVnTTcfl4sBocAZyOz16NRmXXWGaa2vuCT9GevMb0/Ggm+sgUe8oyfq/X+x4LIF!08kptY6mWm3zCENc6DwwSJcmY6mnDCbsNirH0kCOcAnqNs3/uCu/MLB9fOslIUinKfRtNw7H5GvB!H20= X-Complaints-To: abuse@comcast.net X-DMCA-Complaints-To: dmca@comcast.net X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.32 Xref: g2news2.google.com comp.lang.ada:7180 Date: 2006-10-24T07:50:36-04:00 List-Id: Dr. Adrian Wrigley wrote: > On Mon, 23 Oct 2006 21:52:59 +0200, Wiljan Derks wrote: > > >>>Jeffrey R. Carter wrote: >>>I looked at the gcc FORTRAN matmul. Unless there some additional trickery >>>going on behind the scenes, it is not anything magical. It looks like a >>>matmul implemented in C with manual array subscripting logic (i.e. uses a >>>single dimensional array overlay).. >> >>When going for a faster algorithm this might be usefull information. >>Notice that this matmul uses an O(n^3) algorithm and it is possible to do >>this faster. >>I am not sure, but maybe O(n^2 log(n)) is reachable, which is much faster. >>It is easy to see that from the loops that the basic multiplications are >>done n times for the same pairs of numbers. > > > I know you're quite correct that there is a faster algorithm for > large n. But looking at the loops, I don't see where the same > number pairs are multiplied at different points in the calculation. > Doesn't A(1,1)*B(1,1) only appear in C(1,1)? Where else? > > >>In a similar way, the same additions are done multiple times. >>That means that there is real potential to improve things. > > > Aren't they only the same additions if the same products are summed? > > I may be being particularly stupid today, but if the redundancy > is obvious from the loops, can you help me see where! Thanks. > -- > Adrian > I have not looked at algorithms in this area in a few years but I do recall that you need very very large N before any of them start being worthwhile. There are some speedups possibly that stay N**3 but optimize fetches and stores for cache performance and do hand vectorization of portions of the inner loop that can achieve speedup. I would guess that even the library level matmul (which still has the best overall time) could be made about 2x faster on a recent intel platform with those changes. Back to the original discussion, there are quite a few interesting things going on. 1) I have confirmed what another poster wrote that 4.2.0 is substantially slower than 4.0.3 on the original Ada example (thus, this performance is a regression) 2) Small changes to the original code that still meet the original structure and intent of the original code can move the run time up and down by at least 50% 3) The two task based version of the code is running more than 2x faster than the single task version on a 2 processor machine. Some of this is from the two tasks but looking at the assembly, another portion of it is related to #2 above in that the re-arrangement of the math allows the compiler to get less brain dead. If I get a chance tonight, I'll post some update results and analysis of the generated code.