From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.3 required=5.0 tests=BAYES_00,INVALID_MSGID
	autolearn=no autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: fc89c,97188312486d4578
X-Google-Attributes: gidfc89c,public
X-Google-Thread: 109fba,baaf5f793d03d420
X-Google-Attributes: gid109fba,public
X-Google-Thread: 1014db,6154de2e240de72a
X-Google-Attributes: gid1014db,public
X-Google-Thread: 103376,97188312486d4578
X-Google-Attributes: gid103376,public
From: "Marcus H. Mendenhall" <mendenmh@nashville.net>
Subject: Re: (topic change on) Teaching sorts
Date: 1996/08/22
Message-ID: <321C7A2F.49A6@nashville.net>#1/1
X-Deja-AN: 176126073
references: <DwGoHq.6n7@cwi.nl> <01bb8f1b$ce59c820$32ee6fce@timhome2>
 <TANMOY.96Aug21083507@qcd.lanl.gov> <4vfk6b$i6h@krusty.irvine.com>
 <christian.bau-2208961046370001@christian-mac.isltd.insignia.com>
to: Christian Bau <christian.bau@isltd.insignia.com>
content-type: text/plain; charset=us-ascii
organization: Vanderbilt University (most of the time)
mime-version: 1.0
newsgroups: comp.lang.c,comp.lang.c++,comp.unix.programmer,comp.lang.ada
x-mailer: Mozilla 2.02 (Macintosh; I; PPC)
Date: 1996-08-22T00:00:00+00:00
List-Id: <comp.lang.ada>


Christian Bau wrote:
-> On a real computer (PowerMac, no virtual memory, no background 
processes,
-> nothing that would interfere with execution time), the _number of
-> instructions per second_ did reproducably vary by a factor up to 
_seven_
-> when going from n to n+1 (for example, case n = 128 took seven times
-> longer than cases n = 127 and n = 129). So for this computer, and 
this

Isn't cacheing fun?  I have observed many bizarre effects on the 
PowerMacs when one is doing work which involves thrashing memory (FFT's, 
matrix multiplies, etc.).

In effect, one can usually assume that the total number of cpu cycles 
actully used for floating point arithmetic in these cases is 0.  
Counting real memory hits due to cache reloads gives a much more 
accurate measure of time.

In the case of testing your matrix multiply, you could use a trick I did 
to investigate timing for FFT's: I took out all pointer increments from 
the loop, so that the algorithm proceeded as usual, but carried out all 
its operations on the same few bytes of memory.  It yields nonsense for 
the result, but gives an idea of how many cpu cycles are spent on 
everything except fetching. It is sometimes quite shocking (> factor of 
10) the speed increase.

In your case, with the problem at 128 elements, i suspect this was 
because of the way the PowerPC chips (some of them at least) choose 
which cache line to fill with new data, and the 1024 byte offset between 
successive data points probably meant that each fetch required a 
complete cache line reload.

Marcus Mendenhall