From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,dab7d920e4340f12 X-Google-Attributes: gid103376,public X-Google-Thread: 1014db,dab7d920e4340f12 X-Google-Attributes: gid1014db,public From: eachus@spectre.mitre.org (Robert I. Eachus) Subject: Re: C is 'better' than Ada because... Date: 1996/07/30 Message-ID: X-Deja-AN: 171260743 references: <31e02c32.342948604@netline-fddi.jpl.nasa.gov> organization: The Mitre Corp., Bedford, MA. newsgroups: comp.lang.ada,comp.lang.c Date: 1996-07-30T00:00:00+00:00 List-Id: In article <01bb7e2f$aeef4660$87ee6fce@timpent.airshields.com> "Tim Behrendsen" writes: > The above are all worthwhile opimizations; I'm not claiming that > optimizers do nothing. All I'm saying is that too often programmers > fall back on the lazy line, "Oh well, the compiler should take care > of that." For example, if I have ... > for (i = 0; i < 10000000; ++i) > array[i] *= abs(values[i]); > Should I be *sure* this is going to be done efficiently, and use > pointers, or should I just let the compiler handle it? I say that > a competent programmer should use pointers out of habit, because > its just good design and practice. And when the problem becomes > less trivial than the one above, the programmer will already be > using the proper techniques *the first time*, rather than having > to go back and fix the inevitable performance bottlenecks. Let's take that "little" example and ask a few of the questions that a "good" optimizer will consider: 1. Should I have two pointers or three? The value of array[i] needs to be fetched, then multiplied by abs(values[i]), then stored. If I use only two address registers then on many architectures there will be a pipeline stall. On many RISC architectures, I can put the increment of the pointer to array before the RHS evaluation and store to pointer - 1, but other architectures do not allow negative offsets, so you run the pointer from address(array[0]) - 4 to address(array[1000000]) - 4. (Or whatever the length of the elements is. Strength reduction and constant folding are assumed.) On some RISC hardware you can do full range register offsets, so you do (in a fairly obvious intermediate code): LOAD(R1,address(values)) LOAD(R2,address(array)) LOAD(R3,0) LOAD(R4,-4) ADD(R4,R1) LOAD(R5,4000000) LOAD(R6,(R1(R3)) LOAD(R7,(R2(R3)) ADD(R3,4) ABS(R6) MULT(R7,R4(R3)) BNE(R3,R5, L1) STORE(R7,@R4) ... Or some such. Notice that this code corresponds to source code that does NOT use pointers, and it takes fairly sophisticated analysis to get here from code using pointers and doing two increments. 2. How many times should I unroll the loop? 3. Can I get better results by prefetching array elements so the cache is always full? 4. If this machine doesn't have a write-through cache or a deep write pipe, can I design the code so that two array elements are generated in a register pair and written together? (Just because the double precision write was intended for double precision floating results, doesn't mean you can't use it elsewhere.) 5. If this machine has more than one integer arithmetic pipeline, should I use one for offsets and the other for arithmetic, or (more likely) evaluate two elements of the array at once? Now, how many of those optimizations did you think of? And how many do you want to embed in your source code? (I hope the second answer is none.) Let me tell you a story, shortened for this forum. I got a call that our compiler was generating "junk" code for an Ada subroutine. I looked at the source, which went on for over a page, and suggested an equivalent which took one line. The guy promised to try that and call me back. "It generates the same three junk instructions!" He complained. "Oh, what's wrong with them they look fine to me?" "You can't do that in that few instructions!" "Let's see, the first instruction predecrements the stack pointer...why is it doing that? Oh, so it can do an indirect fetch through the second parameter. The second instruction adds the first parameter, why no test for overflow? Ah, analysis must show that it can never happen? Right. Now, the third instruction stores the result where it got the first operand, carefully postincrementing the stack pointer so it will be right in the return. What's wrong with that?" "Do you really think the compiler should be messing with the stack pointer like that?" "Why not, you can't get a trap where it is wrong." "But how does the compiler know that?" "It does a global call tree, and attaches known bounds to every parameter for procedures not visible outside the unit. If you put a declaration in the package spec, you should get the code you expect." "Oh." Now, I thought the peephole optimization that didn't know what a stack pointer was, just when its value had to be trustworthy was cool stuff...in 1984. There are lots better optimizers and code generators around now. For almost any chip of interest, Power PC, Pentium, Pentium Pro, 680x0, SPARC, HPPA, etc., the optimizer is capable of doing incredible things if it knows the specific target version. You should never want to write code that finely tuned to the hardware, but a good optimizer can gain 50% or more improvement by knowing the specific chip model. A horrible example, but very true. I've worked with compilers which generated different calling sequences for 68020, 68030, and 68040. All link time compatible, but the differences in number of clocks for certain instructions meant that the best sequence was different on different processors. And, if you care, the 68040 sequence looked the dumbest, because it was dominated by memory fetches, so fewer instructions, fewer fetches--it actually used the the CALLM instruction. -- Robert I. Eachus with Standard_Disclaimer; use Standard_Disclaimer; function Message (Text: in Clever_Ideas) return Better_Ideas is...