From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,2ac7208e3d69354f X-Google-Attributes: gid103376,public X-Google-ArrivalTime: 2002-06-17 16:47:54 PST Path: archiver1.google.com!news1.google.com!newsfeed.stanford.edu!news-spur1.maxwell.syr.edu!news.maxwell.syr.edu!wn1feed!wn3feed!worldnet.att.net!204.127.198.203!attbi_feed3!attbi.com!sccrnsc03.POSTED!not-for-mail Message-ID: <3D0E7575.1070709@attbi.com> From: "Robert I. Eachus" Organization: Eachus Associates User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4.1) Gecko/20020314 Netscape6/6.2.2 X-Accept-Language: en,pdf MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: Ada and vectorization References: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit NNTP-Posting-Host: 24.61.239.24 X-Complaints-To: abuse@attbi.com X-Trace: sccrnsc03 1024357674 24.61.239.24 (Mon, 17 Jun 2002 23:47:54 GMT) NNTP-Posting-Date: Mon, 17 Jun 2002 23:47:54 GMT Date: Mon, 17 Jun 2002 23:47:54 GMT Xref: archiver1.google.com comp.lang.ada:26190 Date: 2002-06-17T23:47:54+00:00 List-Id: Guillaume Foliard wrote: > I start to learn how to use the Intel's SSE instruction set in Ada > programs with inline assembly. And while reading Intel > documentation (1) I was asking myself if Ada could provide a clean > way of vectorization through its strong-typed approach. Could it > be sensible, for the next Ada revision, to create some new > attributes for array types to explicitly hint the compiler that we > want to use SIMD instructions ? Language lawyers comments are > definitly welcome. As SIMD in modern general purpose processors is > largely available nowadays (SSE, SSE2, Altivec, etc...), IMHO, it > would be a mistake for Ada to ignore the performance benefit this > could bring. Let me answer this with two different hats on. First language lawyer: You have to ask what restrictions imposed by the language prevent the use of these features, then look at how to either relax the restrictions or create language features which explicitly bypass the restrictions. This has been done in Ada. For example: Ada allows non-standard numeric types to allow for things like a floating-point type with inaccurate divides, integer types that cannot be used as array indicies, etc. If you don't need accuracy, you can compile and execute programs with the strict mode of the Numeric Annex turned off. I am not quite that crazy, but it could make sense for 3d display code. ;-) See 11.6 Exceptions and Optimization (and that section can lead to a real long thread...) So if it takes something special to use an SIMD instruction set, the language allows it. In practice all of the existing interesting SIMD extensions can be mapped to standard integer, boolean, float, etc. types. Now from a practical point of view: There are two problems with designing language extensions to map to specific hardware. The first is that software and language lifetimes are much greater than hardware lifetimes. For example, you mention AMD's 3dNow! As it happens, there are three versions of 3dNow! The original version in the K6/2, the extended version in the original Athlons, and the version in the Athlon XP (and Morgan Duron) chips that is a superset of Intel's SSE. The Intel situation is a little clearer, but even there if you are doing a decent (portable) programming job you have to deal with MMX only chips, those with SSE, and those with SSE2. It is much nicer to use the right architecture switch and have the compiler produce efficient code for your target architecture. (If you are really doing a good job, you will isolate all the SIMD dependent code into a few dlls, and have the installer choose the correct version of each for the current hardware. The second practical issue is much nastier. Two implementations of the same identical ISA can have very different performance behavior. Worse, two otherwise exclusive features can have nasty interactions in an implementation. Let me take a simple example, MMX and 3dNow! Athlons allow integers and floating-point values to share architectural registers. Due to the large floating-point register renaming files this is actually a nice feature. But if you reset mode bits, the programmer usually cares which mode bits are used for which operations. The solution is to generate an SFENCE instruction which insures that the view of hardware registers and memory is globally consistant, even for things which are otherwise weakly ordered. This instruction can have almost no latency--or require thousands of clock cycles in the worst cases. (For example a write may cause a TLB miss, and the part of the memory table that needs to be read may not be in L1 or L2 cache.) So what code should a compiler generate? The usual solution is to consider both the average execution time and the variance when choosing between two solutions. Would you rather that the compiler used sequence A, with a minimum of 107 clocks and a maximum of 192, or sequence B with a minumum of 100 clocks and a worst case of 1000? This often results in not using MMX registers or SSE code where the potential savings is only a few percent. If the user forces the compiler to use SSE in all cases, the horrible sequences will be in there along with the good ones. One last horrible problem with the innocent sounding name of store to load forwarding. On modern processors, actual stores from registers to either cache or main memory can take place hundreds of clock cycles later than the beginning of the move instruction. Out of order processors get around this by keeping track of pending writes of renaming registers and if a load instruction for that data is encountered, the load is turned into a no-op, and the register is renamed as the target of the load. But what if only part of the load data is coming from the store and the rest is being read from cache or main memory? Most chips throw up their hands and make the load instruction dependent on the store instruction being retired. This is a nasty cost you don't want to run into. (There are also other ways to run into store to load forwarding problems, but that is another topic.) What if you have a 32-bit integer in an integer register and want to combine it into a 64-bit or 128-bit SSE operand. Uh-oh! Much better to avoid the store to load restrictions and the SSE operations. Again this is something you where you expect (hope?) the compiler will get it right, and forcing the use of SSE can result in very suboptimal code.