From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.3 required=5.0 tests=BAYES_00,INVALID_MSGID autolearn=no autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: fc89c,97188312486d4578 X-Google-Attributes: gidfc89c,public X-Google-Thread: 109fba,baaf5f793d03d420 X-Google-Attributes: gid109fba,public X-Google-Thread: 1014db,6154de2e240de72a X-Google-Attributes: gid1014db,public X-Google-Thread: 103376,97188312486d4578 X-Google-Attributes: gid103376,public From: "Marcus H. Mendenhall" Subject: Re: (topic change on) Teaching sorts Date: 1996/08/22 Message-ID: <321C7A2F.49A6@nashville.net>#1/1 X-Deja-AN: 176126073 references: <01bb8f1b$ce59c820$32ee6fce@timhome2> <4vfk6b$i6h@krusty.irvine.com> to: Christian Bau content-type: text/plain; charset=us-ascii organization: Vanderbilt University (most of the time) mime-version: 1.0 newsgroups: comp.lang.c,comp.lang.c++,comp.unix.programmer,comp.lang.ada x-mailer: Mozilla 2.02 (Macintosh; I; PPC) Date: 1996-08-22T00:00:00+00:00 List-Id: Christian Bau wrote: -> On a real computer (PowerMac, no virtual memory, no background processes, -> nothing that would interfere with execution time), the _number of -> instructions per second_ did reproducably vary by a factor up to _seven_ -> when going from n to n+1 (for example, case n = 128 took seven times -> longer than cases n = 127 and n = 129). So for this computer, and this Isn't cacheing fun? I have observed many bizarre effects on the PowerMacs when one is doing work which involves thrashing memory (FFT's, matrix multiplies, etc.). In effect, one can usually assume that the total number of cpu cycles actully used for floating point arithmetic in these cases is 0. Counting real memory hits due to cache reloads gives a much more accurate measure of time. In the case of testing your matrix multiply, you could use a trick I did to investigate timing for FFT's: I took out all pointer increments from the loop, so that the algorithm proceeded as usual, but carried out all its operations on the same few bytes of memory. It yields nonsense for the result, but gives an idea of how many cpu cycles are spent on everything except fetching. It is sometimes quite shocking (> factor of 10) the speed increase. In your case, with the problem at 128 elements, i suspect this was because of the way the PowerPC chips (some of them at least) choose which cache line to fill with new data, and the 1024 byte offset between successive data points probably meant that each fetch required a complete cache line reload. Marcus Mendenhall