Loop Optimisation [was: Overhead for type conversions?]

comp.lang.ada
 help / color / mirror / Atom feed

From: "Nick Roberts" <nickroberts@blueyonder.co.uk>
Subject: Loop Optimisation [was: Overhead for type conversions?]
Date: Sun, 27 Jul 2003 14:41:48 +0100
Date: 2003-07-27T14:41:48+01:00	[thread overview]
Message-ID: <bg0kn5$j97gf$1@ID-25716.news.uni-berlin.de> (raw)
In-Reply-To: pan.2003.07.26.23.39.45.556296@mail.utexas.edu

"Bobby D. Bryant" <bdbryant@mail.utexas.edu> wrote in message
news:pan.2003.07.26.23.39.45.556296@mail.utexas.edu...
> On Sat, 26 Jul 2003 12:01:43 -0400, Warren W. Gay VE3WWG wrote:
> ...
> > I once suggested something like this to Simon Wright regarding
> > some procedure/function calls within the Booch Components. He
> > did some profiling, and discovered that inlining that I suggested
> > actually worsened the performance slightly under Linux. I don't
> > know if he did more investigations along this line, but for
> > the examples that I suggested, it actually made things worse
> > (due to instruction caching reasons I expect).
> >
> > So inlining should probably be tested before the conclusion is
> > made. [...]

In order to illustrate this, perhaps bizarre-seeming, phenomenon, consider
the following loop:

   loop
      ...
      if C then
         P(...);
      else
         Q(...);
      end if;
      ...
   end loop;

Typical modern processors have a very limited (second level) instruction
cache (a few kilobytes), with limited (or no) associativity. Let's assume
that the code for this loop, where the call (in the Ada sense) of procedure
P is expanded to call instructions (in the machine code sense), all fits
into the instruction cache. Let's also assume that P will tend to be called
less frequently than Q.

If procedure P is inlined, the call to procedure P is replaced by the code
representing P's body; this could cause the size of the loop code to
increase so much that it causes the overall loop to become bigger than the
instruction cache. In this case, the slowing down caused by having to
retrieve some of the instructions in the loop from the first-level cache on
each iteration of the loop could outweigh the speeding up achieved by
eliminating the instructions associated with the machine call to and return
from P.

However, look at a possible skeleton assembly expansion of the loop:

L1:
   ...
   [test C, truth in flag x]
   JNx L2
   [call or expand P]
   JMP L3
L2:
   [Q]
L3:
  ...
L4:
  ;end of loop

JMP means 'jump'. JNx means 'jump if not x'. A clever compiler can (and
will!) change this expansion to the following:

L1:
   ...
   [test C, truth in flag x]
   Jx L2
   [Q]
L3:
  ...
L4:
  ;end of loop
   ...
L2:
   [call or expand P]
   JMP L3

In other words, it pushes the code associated with the less frequently
executed branch of the 'if' statement outside the loop, precisely so as to
avoid the cache overflow problem.

This is a classic form of loop optimisation in the literature. I'd be very
interested to know how many Ada compilers are capable of performing it in
practice.

The much-touted modern technique for determining which branch of an
if-then-else is least often executed is profiling. Personally, my preference
would be for the compiler to assume the 'else' part is more often executed
by default, and to provide a pragma to instruct the compiler otherwise.

I might propose a 'straw man' for such a pragma as follows:

----------

Syntax

   pragma Principal_Path;

Static Semantics

The effect defined by this standard applies only when at most one pragma
Principal_Path occurs within an if statement which contains more than one
sequence of statements. In all other cases, the effect of pragma
Principal_Path is implementation-defined.

Pragma Principal_Path advises the implementation that optimization is to be
based on the assumption that the sequence of statements in which it occurs
will be executed more frequently than any other sequences of statements in
the same if statement.

In the absence of any pragma Principal_Path within an if statement which
contains more than one sequence of statements, the implementation should
assume that the last sequence of statements in the if statement will be
executed more frequently than any of the other sequences of statements.

An implementation should always assume, for any if statement, that any
sequence of statements within the if statement will be executed less
frequently than the if statement itself.

Implementatation Permissions

An implementation may ignore pragma Principal_Path; in particular, it may
override the effect of this pragma (for example, on the basis of a profiling
analysis).

An implementation may provide more sophisticated ways for the programmer to
express relative path importance -- for example by ranking paths
numerically -- possibly by defining the effect of pragma Principal_Path in
places where this standard does not, or by defining parameters for pragma
Principal_Path, or by some other means. Such extensions should be
documented.

----------

I'd also suggest a pragma for the parallelisation of 'for' loops. Maybe in
another post.

--
Nick Roberts
Jabber: debater@charente.de [ICQ: 159718630]

     prev parent reply	other threads:[~2003-07-27 13:41 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-07-25  1:01 Overhead for type conversions? Bobby D. Bryant
2003-07-25  2:36 ` Robert I. Eachus
2003-07-25  7:12   ` Bobby D. Bryant
2003-07-25 15:34 ` Matthew Heaney
2003-07-25 18:28 ` Randy Brukardt
2003-07-26 15:54 ` Nick Roberts
2003-07-26 16:01   ` Warren W. Gay VE3WWG
2003-07-26 23:39     ` Bobby D. Bryant
2003-07-27 13:41       ` Nick Roberts [this message]

replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox