From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 109fba,baaf5f793d03d420
X-Google-Attributes: gid109fba,public
X-Google-Thread: 103376,97188312486d4578
X-Google-Attributes: gid103376,public
X-Google-Thread: fc89c,97188312486d4578
X-Google-Attributes: gidfc89c,public
X-Google-Thread: 1014db,6154de2e240de72a
X-Google-Attributes: gid1014db,public
From: Dan.Pop@cern.ch (Dan Pop)
Subject: Re: What's the best language to start with? [was: Re: Should I learn
 C or Pascal?]
Date: 1996/08/14
Message-ID: <danpop.840019072@news.cern.ch>
X-Deja-AN: 174109744
sender: news@news.cern.ch (USENET News System)
x-nntp-posting-host: ues5.cern.ch
references: <31FBC584.4188@ivic.qc.ca> <01bb8342$88cc6f40$32ee6fcf@timhome2>
 <4u7grn$eb0@news1.mnsinc.com>
 <01bb83ad$29c3cfa0$87ee6fce@timpent.airshields.com>
 <4u89c4$p7p@solutions.solon.com>
 <01bb83f5$923391e0$87ee6fce@timpent.airshields.com>
 <danpop.839450672@news.cern.ch>
 <01bb8534$b2718bc0$87ee6fce@timpent.airshields.com>
 <danpop.839594575@news.cern.ch>
 <01bb87cf$97ae8e80$87ee6fce@timpent.airshields.com>
 <danpop.839807980@news.cern.ch> <01bb88d4$022bf4a0$32ee6fce@timhome2>
organization: CERN European Lab for Particle Physics
newsgroups: comp.lang.c,comp.lang.c++,comp.unix.programmer,comp.lang.ada
Date: 1996-08-14T00:00:00+00:00
List-Id: <comp.lang.ada>


In <01bb88d4$022bf4a0$32ee6fce@timhome2> "Tim Behrendsen" <tim@airshields.com> writes:

>Dan Pop <Dan.Pop@cern.ch> wrote in article
><danpop.839807980@news.cern.ch>...
>> In <01bb87cf$97ae8e80$87ee6fce@timpent.airshields.com> "Tim Behrendsen"
><tim@airshields.com> writes:
>> 
>> >Dan Pop <Dan.Pop@cern.ch> wrote in article
>> ><danpop.839594575@news.cern.ch>...
>> >> In <01bb8534$b2718bc0$87ee6fce@timpent.airshields.com> "Tim
>Behrendsen"
>> ><tim@airshields.com> writes:
>> >> 
>> >> >Here's an example:
>> >> >
>> >> >int a[50000],b[50000],c[50000],d[50000],e[50000];
>> >> >
>> >> >void test1()
>> >> >{
>> >> >    int i, j;
>> >> >    for (j = 0; j < 10; ++j) {
>> >> >        for (i = 0; i < 50000; ++i) {
>> >> >            ++a[i]; ++b[i]; ++c[i]; ++d[i]; ++e[i];
>> >> >        }
>> >> >    }
>> >> >}
>> >> >
>> >> >void test2()
>> >> >{
>> >> >    int i, j;
>> >> >    for (j = 0; j < 10; ++j) {
>> >> >        for (i = 0; i < 50000; ++i) ++a[i];
>> >> >        for (i = 0; i < 50000; ++i) ++b[i];
>> >> >        for (i = 0; i < 50000; ++i) ++c[i];
>> >> >        for (i = 0; i < 50000; ++i) ++d[i];
>> >> >        for (i = 0; i < 50000; ++i) ++e[i];
>> >> >    }
>> >> >}
>> >> >
>> >> >On my AIX system, test1 runs in 2.47 seconds, and test2
>> >> >runs in 1.95 seconds using maximum optimization (-O3).  The
>> >> >reason I knew the second would be faster is because I know
>> >> >to limit the amount of context information the optimizer has
>> >> >to deal with in the inner loops, and I know to keep memory
>> >> >localized.
>> >> 
>> >> 1. For a marginal speed increase (~25%), you compromised the
>readability
>> >>    of the code.
>> >
>> >You call 25% a "marginal" increase?  It's attitudes like this
>> >that give us the slow code world we have right now.
>> 
>> Yes, I do call 25% a marginal increase.  Would you be considerably
>happier
>> if everything would run 25% faster, at the expense of everything being
>> harder to develop and maintain by a factor of 2?
>
>No, I wouldn't be happier with "a factor of 2", but that's a
>very large presumption. 

No, it isn't.  In nontrivial and meaningful examples (i.e. not like yours)
there is quite a difference between the effort to express an algorithm
in the most natural way and in the most efficient way for a certain
compiler/platform combination.  Compare a plain loop with Duff's device
and you'll see what I mean.

>Who's doing anything strange and
>arcane?  I just made it easier on the compiler / computer.

You made it easier on your compiler/computer combination, but not on mine.

>> >And how is one less readable than the other?
>> 
>> Methinks one loop is easier to read than five.
>
>Sometimes, but other times I would rather read five small
>loops right in a row than one mammoth loop that I have to
>take in all at once.

In this particular case, test1 is easier to read.  It also happens to be
faster on some machines.

>In this particular case, I think they are both equally
>readable.

I don't.

>> >> 2. Another compiler, on another system, might generate faster code
>> >>    out of the test1.  This is especially true for supercomputers,
>> >>    which have no cache memory (and where the micro-optimizations are
>done
>> >>    based on a completely different set of criteria) and where the cpu
>> >time
>> >>    is really expensive.
>> >
>> >Show me the computer where test1 comes out faster. 
>> 
>> Already done: my notebook.
>
>Yes, trivially faster.  But how about significantly faster?

There is no machine that I can think of where one of them would be
significantly faster than the other.  Do you have a counterexample?

>> >Or shouldn't we depend on the compiler to optimize this?
>> 
>> Exactly, it's compiler's job to optimize it, not mine.
>
>Well, the compiler didn't do a very good job, does it?

YOUR compiler didn't do a very good job.  By crippling your code for
your compiler you made it less efficient for other platforms.  Hence,
your example was a very bad one.

>And that's the whole point.  It is ivory tower naivete to
>think that compilers optimize everything perfectly every
>time.

It's equally naive to believe that code micro-optimized for one 
platform will be faster on any other platform.

>Yes, you can just wave your hand and say, "well,
>obviously AIX sucks", but in my experience at least, ALL
>compilers suck in one way or another.

And adapting your code to one particular sucking compiler is a great 
idea, if I understood your point right.

>> >> Let's see what happens on my 486DX33 box:
>> >> 
>> >> So, it's 1.10 + 0.23 = 1.33 seconds of cpu time for test1 versus
>> >> 1.17 + 0.18 = 1.35 seconds for test2. 
>> 
>> Get a clue.  If test1 flushes the cache, test2 will flush it, as well.
>> Both implementations behave the same way WRT cache utilization: all
>> 5 x 50000 array elements are accessed in a single iteration of the outer
>> loop, hence the same thing (cache hit or cache miss) will happen at the
>> next iteration in both versions.  Cache is simply a non-issue in your
>> example (even if you intended it to be :-)
>
>- Sigh - I must really think about these posts for more than
>the two minutes per thread.  You are right, of course.  The
>cache is irrelevent to this example.  It's more likely the
>fact that the first case is makes better use of page locality.
>It may also be that the compiler ran out of address registers
>with five arrays (or a combination of both).

I wasn't aware that the POWER architecture has "address registers".
Anyway, my 486 definitely ran out of registers, but this didn't prevent
test1 from be marginally faster.

>Of course, I could restructure the example to take advantage
>of the cache, and get even more improvement. :)

Or make test1 even faster on my notebook, with only 8k of cache :-)
 
>> >In any case, the code is identically
>> >readable either way IMO, and costs you nothing to implement
>> >the efficient way.
>> 
>> You forgot to provide a VALID justification for your claim that test2
>> is the "efficient way".
>
>It's probably mostly the paging locality.

Both versions have excellent paging locality.  Read them again and tell
me which one is likely to produce more page faults than the other.

>Did you run your
>test under DOS or Windows?  It would be interesting to see
>if there was a difference.

I have better uses for my time than to use DOS and Windows.  But the
Linux/gcc combination proved your claim wrong.

>> >So it wasn't on one architecture, big deal.
>> 
>> Can you prove that it will be on any other architecture?  According
>> to your line of argumentation, I could say: "so it was on one
>architecture,
>> big deal" :-)
>
>I would have to try it on other architectures to be sure, but
>can you come up with a scenerio that the second would be
>significantly slower?

Can you come with a scenario that the second would be significantly
faster?

>I would say that on the average paging
>architecture, memory locality tends to be quite important to
>performance.

And I have to repeat that both implementations have similar memory
locality.  Unless you can prove otherwise.

>> >Agreed; optimization shouldn't be to a certain platform, but
>> >optimizations can be made on general basis.
>> 
>> Right.  By selecting the proper algorithm.  Most other kinds of
>> optimizations will be specific to a certain machine, or, at best, class
>> of machines.  For example, an optimization which tries to improve the
>> cache hit rate might hurt on a supercomputer which has no cache, but it
>> is adversely affected by certain memory access patterns.
>
>Well, if you're using a super computer, chances are you *will*
>optimize to that architecture, because super computer time
>tends to be expensive.  For general purpose software, the
>architectures are pretty much the same all around.  They all
>have caches, they all have paging.

Are you kidding or what?  Please explain the paging sheme of MSDOS, the
most popular platform in the world (unfortunately).  And cacheless
architectures are common at both ends of the spectrum.

>> When programming for a single platform/compiler combination, it makes
>> sense to perform micro-optimizations, _if they're rewarding enough_.
>> But then, if portability is not an issue, the critical code could be
>> written in assembly, as well.  When writing portable code in a HLL,
>> micro-optimizations are usually a waste of effort and programmer time.
>
>I think that most architectures in the real world are close
>enough that you can find commonality, such as inefficient
>recursion.

There are plenty of algorithms where recursion is the most efficient
implementation.  Just ask any Fortran programmer who had to implement
such an algorithm without using recursion.

Try to find a better example next time.

Dan
--
Dan Pop
CERN, CN Division
Email: Dan.Pop@cern.ch 
Mail:  CERN - PPE, Bat. 31 R-004, CH-1211 Geneve 23, Switzerland