From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,2ac7208e3d69354f
X-Google-Attributes: gid103376,public
X-Google-ArrivalTime: 2002-06-17 16:47:54 PST
Path: 
 archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!wn1feed!wn3feed!worldnet.att.net!204.127.198.203!attbi_feed3!attbi.com!sccrnsc03.POSTED!not-for-mail
Message-ID: <3D0E7575.1070709@attbi.com>
From: "Robert I. Eachus" <rieachus@attbi.com>
Organization: Eachus Associates
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
 rv:0.9.4.1) Gecko/20020314 Netscape6/6.2.2
X-Accept-Language: en,pdf
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: Ada and vectorization
References: <aehnbn$9ea$1@wanadoo.fr>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
NNTP-Posting-Host: 24.61.239.24
X-Complaints-To: abuse@attbi.com
X-Trace: sccrnsc03 1024357674 24.61.239.24 (Mon, 17 Jun 2002 23:47:54 GMT)
NNTP-Posting-Date: Mon, 17 Jun 2002 23:47:54 GMT
Date: Mon, 17 Jun 2002 23:47:54 GMT
Xref: archiver1.google.com comp.lang.ada:26190
Date: 2002-06-17T23:47:54+00:00
List-Id: <comp.lang.ada>

Guillaume Foliard wrote:


 > I start to learn how to use the Intel's SSE instruction set in Ada
 > programs  with inline assembly. And while reading Intel
 > documentation (1) I was  asking myself if Ada could provide a clean
 >  way of vectorization through its  strong-typed approach. Could it
 > be sensible, for the next Ada revision, to  create some new
 > attributes for array types to explicitly hint the compiler  that we
 >  want to use SIMD instructions ? Language lawyers comments are
 > definitly welcome. As SIMD in modern general  purpose processors is
 >  largely available nowadays (SSE, SSE2, Altivec,  etc...), IMHO, it
 >  would be a mistake for Ada to ignore the performance  benefit this
 >  could bring.


Let me answer this with two different hats on.

First language lawyer:  You have to ask what restrictions imposed by the
language prevent the use of these features, then look at how to either
relax the restrictions or create language features which explicitly
bypass the restrictions.  This has been done in Ada.  For example:

Ada allows non-standard numeric types to allow for things like a
floating-point type with inaccurate divides, integer types that cannot
be used as array indicies, etc.

If you don't need accuracy, you can compile and execute programs with
the strict mode of the Numeric Annex turned off.  I am not quite that
crazy, but it could make sense for 3d display code. ;-)

See 11.6 Exceptions and Optimization (and that section can lead to a
real long thread...)

So if it takes something special to use an SIMD instruction set, the
language allows it.  In practice all of the existing interesting SIMD
extensions can be mapped to standard integer, boolean, float, etc. types.

Now from a practical point of view:  There are two problems with
designing language extensions to map to specific hardware.  The first is
that software and language lifetimes are much greater than hardware
lifetimes.  For example, you mention AMD's 3dNow!  As it happens, there
are three versions of 3dNow!  The original version in the K6/2, the
extended version in the original Athlons, and the version in the Athlon
XP (and Morgan Duron) chips that is a superset of Intel's SSE.

The Intel situation is a little clearer, but even there if you are doing
a decent (portable) programming job you have to deal with MMX only
chips, those with SSE, and those with SSE2.  It is much nicer to use the
right architecture switch and have the compiler produce efficient code
for your target architecture.  (If you are really doing a good job, you
will isolate all the SIMD dependent code into a few dlls, and have the
installer choose the correct version of each for the current hardware.


The second practical issue is much nastier. Two implementations of the 
same identical ISA can have very different performance behavior.  Worse, 
two otherwise exclusive features can have nasty interactions in an 
implementation.  Let me take a simple example, MMX and 3dNow!  Athlons 
allow integers and floating-point values to share architectural 
registers.  Due to the large floating-point register renaming files this 
is actually a nice feature.  But if you reset mode bits, the programmer 
usually cares which mode bits are used for which operations.  The 
solution is to generate an SFENCE instruction which insures that the 
view of hardware registers and memory is globally consistant, even for 
things which are otherwise weakly ordered. This instruction can have 
almost no latency--or require thousands of clock cycles in the worst 
cases.  (For example a write may cause a TLB miss, and the part of the 
memory table that needs to be read may not be in L1 or L2 cache.)

So what code should a compiler generate?  The usual solution is to 
consider both the average execution time and the variance when choosing 
between two solutions.  Would you rather that the compiler used sequence 
A, with a minimum of 107 clocks and a maximum of 192, or sequence B with 
a minumum of 100 clocks and a worst case of 1000?  This often results in 
not using MMX registers or SSE code where the potential savings is only 
a few percent.  If the user forces the compiler to use SSE in all cases, 
the horrible sequences will be in there along with the good ones.

One last horrible problem with the innocent sounding name of store to 
load forwarding.  On modern processors, actual stores from registers to 
either cache or main memory can take place hundreds of clock cycles 
later than the beginning of the move instruction.  Out of order 
processors get around this by keeping track of pending writes of 
renaming registers and if a load instruction for that data is 
encountered, the load is turned into a no-op, and the register is 
renamed as the target of the load.

But what if only part of the load data is coming from the store and the 
rest is being read from cache or main memory?  Most chips throw up their 
hands and make the load instruction dependent on the store instruction 
being retired.  This is a nasty cost you don't want to run into.  (There 
are also other ways to run into store to load forwarding problems, but 
that is another topic.)  What if you have a 32-bit integer in an integer 
  register and want to combine it into a 64-bit or 128-bit SSE operand. 
  Uh-oh!  Much better to avoid the store to load restrictions and the 
SSE operations.  Again this is something you where you expect (hope?) 
the compiler will get it right, and forcing the use of SSE can result in 
very suboptimal code.