From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,38fc011071df5a27
X-Google-Attributes: gid103376,public
X-Google-ArrivalTime: 2003-06-05 18:59:10 PST
Path: 
 archiver1.google.com!news1.google.com!newsfeed.stanford.edu!logbridge.uoregon.edu!arclight.uoregon.edu!wn13feed!wn12feed!worldnet.att.net!204.127.198.203!attbi_feed3!attbi.com!rwcrnsc52.ops.asp.att.net.POSTED!not-for-mail
Message-ID: <3EDFF543.7060804@attbi.com>
From: "Robert I. Eachus" <rieachus@attbi.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
 rv:1.0.2) Gecko/20021120 Netscape/7.01
X-Accept-Language: en-us, en
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: Ideas for Ada 200X
References: <6a90b886.0305262344.1d558079@posting.google.com>
 <3ED41344.7090105@spam.com> <3ED46D81.FF62C34F@0.0> <3ED46E07.4340CABC@0.0>
 <3ED4F3FD.A0EF7079@alfred-hilscher.de> <xIeBa.1017346$3D1.597085@sccrnsc01>
 <bebbba07.0305292326.75f5fbef@posting.google.com>
 <6vWcnTWjF83bD0qjXTWcpA@gbronline.com>
 <bebbba07.0305311049.753bb8f0@posting.google.com>
 <KvydnZ2dnunHfUej4p2dnA@gbronline.com>
 <bebbba07.0306022127.46f6c998@posting.google.com>
 <3EDCBDF4.1050900@attbi.com>
 <bebbba07.0306031420.69c20f71@posting.google.com>
 <3EDEC9A7.9050602@attbi.com> <uvfvkflgs.fsf@nasa.gov>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
NNTP-Posting-Host: 24.62.164.137
X-Complaints-To: abuse@attbi.com
X-Trace: rwcrnsc52.ops.asp.att.net 1054864749 24.62.164.137 (Fri,
 06 Jun 2003 01:59:09 GMT)
NNTP-Posting-Date: Fri, 06 Jun 2003 01:59:09 GMT
Organization: AT&T Broadband
Date: Fri, 06 Jun 2003 01:59:10 GMT
Xref: archiver1.google.com comp.lang.ada:38745
Date: 2003-06-06T01:59:10+00:00
List-Id: <comp.lang.ada>

Stephen Leake wrote:

 > Just out of curiosity, why is turning of checks a lose for Float? How
 >  do checks make the Float code faster?

Let me give you the thirty-thousand foot veiw, then get down and dirty.
  Every compiler generated check has two effects.  The first is that it
costs execution resources to perform a check.  The second is that the
compiler remembers the state of the check if made, and this can simplify
other code, including other (non-suppressed checks). The net cost of
doing the checking is the difference of two quantities, so it can be
positive or negative.

It is also worth noting that suppressing some checks is often the worst 
of both worlds.  Suprressing a check outside a loop can result in 
another check inside the loop needing to be made once for every 
repetition of the loop.  There is a rule which is honored in the spirit 
not for legalistic reasons that says that Ada programs don't page fault. 
     In reality what this means is that even if you suppress bounds 
checking,  indexed writes won't overwrite areas of memory that are not 
data for the Ada program.  In the two dimensional array case in this 
program, this means that the compiler should still do a check of whether 
the writes fall within the object.  Bounds check on reads need not be 
done, and for  a write to Temp(I,J), I and J will not be checked to see 
that they are within bounds, just that the computed offset is within the 
object.

So suppressing all checks is not required to actually suppress all 
checks.  It is guidance to the compiler saying that all but the most 
minimal of checks should be suppressed.  (Could you ask your vendor to 
go ahead and remove some of these minimal checks?  Sure, some can't be 
removed since they are automatically performed by the hardware--writing 
to a read only segment, or on many systems executing an instruction from 
an odd byte address.  The rest can be done away with.  But the compiler 
vendor makes a judgement on what pragma Suppress(All_Checks); should do, 
and they tend to get it pretty close to some optimal definition of right.

Now to get down and dirty.  Most modern CPU chips have multiple 
execution units and are "superscalar" which means that more than one 
assembly language instruction can get dispatched to execution pipes on a 
single clock cycle.  I'll use the Athlon here as an example, because 
that is the CPU I ran the test on.  In a single clock cycle, the CPU 
grabs up to 64 bits of instructions (eight bytes) and decodes at most 
three instructions from the bits it has. (Some of those could be 
carryovers from the previous clock cycle.)  The x86 ISA has some 
one-byte instructions, and worse case instructions can be more than 
eight bytes long.  These instructions are then decoded into micro-ops. 
Most x86 instructions are decoded as one or two micro-ops but what are 
called Vector path instructions are special.  They have to be the only 
instruction decoded on a clock cycle, and they can be translated into an 
unlimited number of micro-ops. For example, a block copy.

The processor then dispatches the instructions to either the integer and 
reordering buffers or to the floating point reordering buffer.  There 
are three queues of micro-ops waiting for dispatch to the three integer 
and three logical pipes.  These can be dispatched out of order, but the 
instructions stay with the pair of execution units they have been 
assigned to. In addition there is a floating point instruction 
reordering buffer.  There are three floating point execution units, a 
floating load store unit which can also execute most MMX instructions, a 
floating multiply divide unit and a floating add and subtract unit.

So for our little test program, all the floating-point arithmetic 
instructions had to be done by the floating point add subtract pipe, but 
that was not the computational bottleneck. For every floating add there 
were at least two floating loads and one floating store. But those are 
still not the bottleneck.  Three load/stores would require 3,000,000 
instructions to be executed, and on a 700 MHz CPU that takes about 4 
milliseconds.

So where did all the "extra" effort go.  Some when into the integer and 
logical micro-ops involved in the loop and the address computations. 
But even if this took three integer and three logical operations per 
floating point load or store, that wouldn't take any more CPU time.  The 
answer is in the memory accesses.  Again we need three million 32-bit 
memory accesses.  This computer has 100 MHz SDRAM DIMMs.  Figure that 
even at 100% efficiency, the main memory accesses will take 15 
milliseconds, and of course 100% efficiency is not going to happen.  But 
it should be possible to get pretty close.  Once my regular CPU is back, 
the memory will be more than twice as fast (PC 2100 DDR) and the CPU 
will be just over twice as fast so the memory to CPU clock ratio will be 
about the same.

As you can see, the "overhead" of doing the bounds and overflow checking 
got lost way back.  On computationally heavy problems where you need to 
spend your time is in optimizing the memory accesses so that as few as 
possible occur, and as many as possible are from L1 or L2 cache.