From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Thread: 103376,7767a311e01e1cd X-Google-Attributes: gid103376,public X-Google-Language: ENGLISH,ASCII-7-bit Path: g2news2.google.com!news1.google.com!news4.google.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local01.nntp.dca.giganews.com!nntp.comcast.com!news.comcast.com.POSTED!not-for-mail NNTP-Posting-Date: Fri, 20 Oct 2006 10:58:34 -0500 Date: Fri, 20 Oct 2006 11:56:50 -0400 From: Jeffrey Creem User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923) X-Accept-Language: en-us, en MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: GNAT compiler switches and optimization References: <1161341264.471057.252750@h48g2000cwc.googlegroups.com> In-Reply-To: <1161341264.471057.252750@h48g2000cwc.googlegroups.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Message-ID: NNTP-Posting-Host: 24.147.74.171 X-Trace: sv3-JZUOwHvJu8xErApECswYaBrXKrfGl1sjWVIDA+fzkeAf7xIlzgAu6fg+Zrq/ST/WMZeHpV5Rm0DW80S!HEWKCGylZ42tgoKzPTRmj/OtDHHqNMen5iY6ppJLvUYjloNWB4yNwNWXNwGHcANUhsv2txU5Kmd5!0oM= X-Complaints-To: abuse@comcast.net X-DMCA-Complaints-To: dmca@comcast.net X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.32 Xref: g2news2.google.com comp.lang.ada:7084 Date: 2006-10-20T11:56:50-04:00 List-Id: tkrauss wrote: > I'm a bit stuck trying to figure out how to coax more performance > out of some Ada code. I suspect there is something simple (like > compiler switches) but I'm missing it. As an example I'm using > a simple matrix multiply and comparing it to similar code in > Fortran. Unfortunately the Ada code takes 3-4 times as long. > There have been a few useful comments (and quite a few not really useful ones) but in the end, it seems pretty clear to me that in this particular case GNAT sucks compared to the fortran version. I built the gcc "head" from gcc SVN with GNAT and Fortran to compare the same versions (at least as much as possible). I moved the start timing calls after the array allocation and filling so we just timing the matrix multiplication I end moved the timing calls to make sure we were not timing IO in either case (both original versions were timing part of the "put"). I replaced the "random" data with some fixed sane data just to be sure there was no funky "denormal" stuff happening that changed the speed. Very little change in the order of magnitude that the original poster was seeing (I pretty much get results with GNAT runnig about 2.6 times slower) so it was time to look at the assembly. I find it easier to read assembly using sse math so building gnat via gnatmake -g -f -gnatp -O3 -march=pentium4 -fomit-frame-pointer -mfpmath=sse tst_array and fotran via gfortran -O3 -g -march=pentium4 -fomit-frame-pointer -mfpmath=sse -c tst_array.f95 and then using objdump -D -S tst_array.o to look at them, you pretty quickly can see the problem. The "inner loop" of the fortran code looks like 2d0: 8d 04 19 lea (%ecx,%ebx,1),%eax 2d3: f3 0f 10 02 movss (%edx),%xmm0 2d7: f3 0f 59 44 85 04 mulss 0x4(%ebp,%eax,4),%xmm0 2dd: f3 0f 58 c8 addss %xmm0,%xmm1 2e1: 83 c1 01 add $0x1,%ecx 2e4: 01 f2 add %esi,%edx 2e6: 39 f9 cmp %edi,%ecx 2e8: 75 e6 jne 2d0 The "inner loop of the Ada code looks like af2: 83 c6 01 add $0x1,%esi af5: 89 f0 mov %esi,%eax af7: 2b 44 24 28 sub 0x28(%esp),%eax afb: 03 44 24 30 add 0x30(%esp),%eax aff: 8b 5c 24 38 mov 0x38(%esp),%ebx b03: f3 0f 10 0c 83 movss (%ebx,%eax,4),%xmm1 b08: 8b 4d 00 mov 0x0(%ebp),%ecx b0b: 8b 45 0c mov 0xc(%ebp),%eax b0e: 8b 55 08 mov 0x8(%ebp),%edx b11: 8b 5c 24 78 mov 0x78(%esp),%ebx b15: 29 d3 sub %edx,%ebx b17: 89 f7 mov %esi,%edi b19: 29 cf sub %ecx,%edi b1b: 89 f9 mov %edi,%ecx b1d: 83 c0 01 add $0x1,%eax b20: 29 d0 sub %edx,%eax b22: 01 c0 add %eax,%eax b24: 01 c0 add %eax,%eax b26: ba 00 00 00 00 mov $0x0,%edx b2b: 0f 48 c2 cmovs %edx,%eax b2e: 0f af c8 imul %eax,%ecx b31: 8d 1c 99 lea (%ecx,%ebx,4),%ebx b34: 8b 44 24 3c mov 0x3c(%esp),%eax b38: f3 0f 10 04 03 movss (%ebx,%eax,1),%xmm0 b3d: f3 0f 59 c1 mulss %xmm1,%xmm0 b41: f3 0f 58 d0 addss %xmm0,%xmm2 b45: 3b 74 24 7c cmp 0x7c(%esp),%esi b49: 75 a7 jne af2 <_ada_tst_array+0x254> 28 Instructions v.s. 8 for fortran. The GNAT version never stood a chance. It really seems like GNAT is dropping the ball here. Granted small benchmarks can really lead one to believe things are better or worse than the truth but I don't think there is really an excuse in this case for this sort of performance.