From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,38fc011071df5a27 X-Google-Attributes: gid103376,public X-Google-ArrivalTime: 2003-06-05 18:59:10 PST Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!logbridge.uoregon.edu!arclight.uoregon.edu!wn13feed!wn12feed!worldnet.att.net!204.127.198.203!attbi_feed3!attbi.com!rwcrnsc52.ops.asp.att.net.POSTED!not-for-mail Message-ID: <3EDFF543.7060804@attbi.com> From: "Robert I. Eachus" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.0.2) Gecko/20021120 Netscape/7.01 X-Accept-Language: en-us, en MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: Ideas for Ada 200X References: <6a90b886.0305262344.1d558079@posting.google.com> <3ED41344.7090105@spam.com> <3ED46D81.FF62C34F@0.0> <3ED46E07.4340CABC@0.0> <3ED4F3FD.A0EF7079@alfred-hilscher.de> <6vWcnTWjF83bD0qjXTWcpA@gbronline.com> <3EDCBDF4.1050900@attbi.com> <3EDEC9A7.9050602@attbi.com> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit NNTP-Posting-Host: 24.62.164.137 X-Complaints-To: abuse@attbi.com X-Trace: rwcrnsc52.ops.asp.att.net 1054864749 24.62.164.137 (Fri, 06 Jun 2003 01:59:09 GMT) NNTP-Posting-Date: Fri, 06 Jun 2003 01:59:09 GMT Organization: AT&T Broadband Date: Fri, 06 Jun 2003 01:59:10 GMT Xref: archiver1.google.com comp.lang.ada:38745 Date: 2003-06-06T01:59:10+00:00 List-Id: Stephen Leake wrote: > Just out of curiosity, why is turning of checks a lose for Float? How > do checks make the Float code faster? Let me give you the thirty-thousand foot veiw, then get down and dirty. Every compiler generated check has two effects. The first is that it costs execution resources to perform a check. The second is that the compiler remembers the state of the check if made, and this can simplify other code, including other (non-suppressed checks). The net cost of doing the checking is the difference of two quantities, so it can be positive or negative. It is also worth noting that suppressing some checks is often the worst of both worlds. Suprressing a check outside a loop can result in another check inside the loop needing to be made once for every repetition of the loop. There is a rule which is honored in the spirit not for legalistic reasons that says that Ada programs don't page fault. In reality what this means is that even if you suppress bounds checking, indexed writes won't overwrite areas of memory that are not data for the Ada program. In the two dimensional array case in this program, this means that the compiler should still do a check of whether the writes fall within the object. Bounds check on reads need not be done, and for a write to Temp(I,J), I and J will not be checked to see that they are within bounds, just that the computed offset is within the object. So suppressing all checks is not required to actually suppress all checks. It is guidance to the compiler saying that all but the most minimal of checks should be suppressed. (Could you ask your vendor to go ahead and remove some of these minimal checks? Sure, some can't be removed since they are automatically performed by the hardware--writing to a read only segment, or on many systems executing an instruction from an odd byte address. The rest can be done away with. But the compiler vendor makes a judgement on what pragma Suppress(All_Checks); should do, and they tend to get it pretty close to some optimal definition of right. Now to get down and dirty. Most modern CPU chips have multiple execution units and are "superscalar" which means that more than one assembly language instruction can get dispatched to execution pipes on a single clock cycle. I'll use the Athlon here as an example, because that is the CPU I ran the test on. In a single clock cycle, the CPU grabs up to 64 bits of instructions (eight bytes) and decodes at most three instructions from the bits it has. (Some of those could be carryovers from the previous clock cycle.) The x86 ISA has some one-byte instructions, and worse case instructions can be more than eight bytes long. These instructions are then decoded into micro-ops. Most x86 instructions are decoded as one or two micro-ops but what are called Vector path instructions are special. They have to be the only instruction decoded on a clock cycle, and they can be translated into an unlimited number of micro-ops. For example, a block copy. The processor then dispatches the instructions to either the integer and reordering buffers or to the floating point reordering buffer. There are three queues of micro-ops waiting for dispatch to the three integer and three logical pipes. These can be dispatched out of order, but the instructions stay with the pair of execution units they have been assigned to. In addition there is a floating point instruction reordering buffer. There are three floating point execution units, a floating load store unit which can also execute most MMX instructions, a floating multiply divide unit and a floating add and subtract unit. So for our little test program, all the floating-point arithmetic instructions had to be done by the floating point add subtract pipe, but that was not the computational bottleneck. For every floating add there were at least two floating loads and one floating store. But those are still not the bottleneck. Three load/stores would require 3,000,000 instructions to be executed, and on a 700 MHz CPU that takes about 4 milliseconds. So where did all the "extra" effort go. Some when into the integer and logical micro-ops involved in the loop and the address computations. But even if this took three integer and three logical operations per floating point load or store, that wouldn't take any more CPU time. The answer is in the memory accesses. Again we need three million 32-bit memory accesses. This computer has 100 MHz SDRAM DIMMs. Figure that even at 100% efficiency, the main memory accesses will take 15 milliseconds, and of course 100% efficiency is not going to happen. But it should be possible to get pretty close. Once my regular CPU is back, the memory will be more than twice as fast (PC 2100 DDR) and the CPU will be just over twice as fast so the memory to CPU clock ratio will be about the same. As you can see, the "overhead" of doing the bounds and overflow checking got lost way back. On computationally heavy problems where you need to spend your time is in optimizing the memory accesses so that as few as possible occur, and as many as possible are from L1 or L2 cache.