From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,dab7d920e4340f12
X-Google-Attributes: gid103376,public
X-Google-Thread: 1014db,dab7d920e4340f12
X-Google-Attributes: gid1014db,public
From: eachus@spectre.mitre.org (Robert I. Eachus)
Subject: Re: C is 'better' than Ada because...
Date: 1996/07/30
Message-ID: <EACHUS.96Jul30183149@spectre.mitre.org>
X-Deja-AN: 171260743
references: <31e02c32.342948604@netline-fddi.jpl.nasa.gov>
organization: The Mitre Corp., Bedford, MA.
newsgroups: comp.lang.ada,comp.lang.c
Date: 1996-07-30T00:00:00+00:00
List-Id: <comp.lang.ada>


In article <01bb7e2f$aeef4660$87ee6fce@timpent.airshields.com> "Tim Behrendsen" <tim@airshields.com> writes:

  > The above are all worthwhile opimizations; I'm not claiming that
  > optimizers do nothing.  All I'm saying is that too often programmers
  > fall back on the lazy line, "Oh well, the compiler should take care
  > of that."  For example, if I have ...

  > for (i = 0; i < 10000000; ++i) 
  >     array[i] *= abs(values[i]);

  > Should I be *sure* this is going to be done efficiently, and use
  > pointers, or should I just let the compiler handle it? I say that
  > a competent programmer should use pointers out of habit, because
  > its just good design and practice.  And when the problem becomes
  > less trivial than the one above, the programmer will already be
  > using the proper techniques *the first time*, rather than having
  > to go back and fix the inevitable performance bottlenecks.

   Let's take that "little" example and ask a few of the questions
that a "good" optimizer will consider:

   1.  Should I have two pointers or three?  The value of array[i]
needs to be fetched, then multiplied by abs(values[i]), then stored.
If I use only two address registers then on many architectures there
will be a pipeline stall.  On many RISC architectures, I can put the
increment of the pointer to array before the RHS evaluation and store
to pointer - 1, but other architectures do not allow negative offsets,
so you run the pointer from address(array[0]) - 4 to
address(array[1000000]) - 4. (Or whatever the length of the elements
is. Strength reduction and constant folding are assumed.)  On some
RISC hardware you can do full range register offsets, so you do (in a
fairly obvious intermediate code):

      LOAD(R1,address(values))
      LOAD(R2,address(array))
      LOAD(R3,0)
      LOAD(R4,-4)
      ADD(R4,R1)
      LOAD(R5,4000000)
    <L1>
      LOAD(R6,(R1(R3))
      LOAD(R7,(R2(R3))
      ADD(R3,4)
      ABS(R6)
      MULT(R7,R4(R3))
      BNE(R3,R5, L1)
      STORE(R7,@R4)
      ...

      Or some such.  Notice that this code corresponds to source code
that does NOT use pointers, and it takes fairly sophisticated analysis
to get here from code using pointers and doing two increments.

    2. How many times should I unroll the loop?

    3. Can I get better results by prefetching array elements so the
cache is always full?

    4. If this machine doesn't have a write-through cache or a deep
write pipe, can I design the code so that two array elements are
generated in a register pair and written together?  (Just because the
double precision write was intended for double precision floating
results, doesn't mean you can't use it elsewhere.)

    5. If this machine has more than one integer arithmetic pipeline,
should I use one for offsets and the other for arithmetic, or (more
likely) evaluate two elements of the array at once?

    Now, how many of those optimizations did you think of?  And how
many do you want to embed in your source code?  (I hope the second
answer is none.)

    Let me tell you a story, shortened for this forum.  I got a call
that our compiler was generating "junk" code for an Ada subroutine.  I
looked at the source, which went on for over a page, and suggested an
equivalent which took one line.  The guy promised to try that and call
me back.

    "It generates the same three junk instructions!"  He complained.

    "Oh, what's wrong with them they look fine to me?"

    "You can't do that in that few instructions!"

    "Let's see, the first instruction predecrements the stack
pointer...why is it doing that?  Oh, so it can do an indirect fetch
through the second parameter.  The second instruction adds the first
parameter, why no test for overflow?  Ah, analysis must show that it
can never happen?  Right.  Now, the third instruction stores the
result where it got the first operand, carefully postincrementing the
stack pointer so it will be right in the return.  What's wrong with
that?"

     "Do you really think the compiler should be messing with the
stack pointer like that?"

     "Why not, you can't get a trap where it is wrong."

     "But how does the compiler know that?"

     "It does a global call tree, and attaches known bounds to every
parameter for procedures not visible outside the unit.  If you put a
declaration in the package spec, you should get the code you expect."

     "Oh."

      Now, I thought the peephole optimization that didn't know what a
stack pointer was, just when its value had to be trustworthy was cool
stuff...in 1984.  There are lots better optimizers and code generators
around now.  For almost any chip of interest, Power PC, Pentium,
Pentium Pro, 680x0, SPARC, HPPA, etc., the optimizer is capable of
doing incredible things if it knows the specific target version.  You
should never want to write code that finely tuned to the hardware, but
a good optimizer can gain 50% or more improvement by knowing the
specific chip model.

     A horrible example, but very true.  I've worked with compilers
which generated different calling sequences for 68020, 68030, and
68040.  All link time compatible, but the differences in number of
clocks for certain instructions meant that the best sequence was
different on different processors.  And, if you care, the 68040
sequence looked the dumbest, because it was dominated by memory
fetches, so fewer instructions, fewer fetches--it actually used the
the CALLM instruction.

--

					Robert I. Eachus

with Standard_Disclaimer;
use  Standard_Disclaimer;
function Message (Text: in Clever_Ideas) return Better_Ideas is...