From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: 103376,7767a311e01e1cd
X-Google-Attributes: gid103376,public
X-Google-Language: ENGLISH,ASCII-7-bit
Path: 
 g2news2.google.com!news1.google.com!news4.google.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local01.nntp.dca.giganews.com!nntp.comcast.com!news.comcast.com.POSTED!not-for-mail
NNTP-Posting-Date: Fri, 20 Oct 2006 10:58:34 -0500
Date: Fri, 20 Oct 2006 11:56:50 -0400
From: Jeffrey Creem <jeff@thecreems.com>
User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923)
X-Accept-Language: en-us, en
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: GNAT compiler switches and optimization
References: <1161341264.471057.252750@h48g2000cwc.googlegroups.com>
In-Reply-To: <1161341264.471057.252750@h48g2000cwc.googlegroups.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Message-ID: <g68n04-7ft.ln1@newserver.thecreems.com>
NNTP-Posting-Host: 24.147.74.171
X-Trace: 
 sv3-JZUOwHvJu8xErApECswYaBrXKrfGl1sjWVIDA+fzkeAf7xIlzgAu6fg+Zrq/ST/WMZeHpV5Rm0DW80S!HEWKCGylZ42tgoKzPTRmj/OtDHHqNMen5iY6ppJLvUYjloNWB4yNwNWXNwGHcANUhsv2txU5Kmd5!0oM=
X-Complaints-To: abuse@comcast.net
X-DMCA-Complaints-To: dmca@comcast.net
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint
 properly
X-Postfilter: 1.3.32
Xref: g2news2.google.com comp.lang.ada:7084
Date: 2006-10-20T11:56:50-04:00
List-Id: <comp.lang.ada>

tkrauss wrote:
> I'm a bit stuck trying to figure out how to coax more performance
> out of some Ada code.  I suspect there is something simple (like
> compiler switches) but I'm missing it.  As an example I'm using
> a simple matrix multiply and comparing it to similar code in
> Fortran.  Unfortunately the Ada code takes 3-4 times as long.
> 

There have been a few useful comments (and quite a few not really useful 
ones) but in the end, it seems pretty clear to me that in this 
particular case GNAT sucks compared to the fortran version.

I built the gcc "head" from gcc SVN with GNAT and Fortran to compare the 
  same versions (at least as much as possible).

I moved the start timing calls after the array allocation and filling so 
we just timing the matrix multiplication

I end moved the timing calls to make sure we were not timing IO in 
either case (both original versions were timing part of the "put").

I replaced the "random" data with some fixed sane data just to be sure 
there was no funky "denormal" stuff happening that changed the speed.

Very little change in the order of magnitude that the original poster 
was seeing (I pretty much get results with GNAT runnig about 2.6 times 
slower) so it was time to look at the assembly.

I find it easier to read assembly using sse math so building gnat via

gnatmake -g -f -gnatp -O3  -march=pentium4 -fomit-frame-pointer 
-mfpmath=sse tst_array

and fotran via

gfortran -O3 -g -march=pentium4 -fomit-frame-pointer -mfpmath=sse -c 
tst_array.f95

and then using
objdump -D -S tst_array.o

to look at them, you pretty quickly can see the problem.

The "inner loop" of the fortran code looks like

2d0:   8d 04 19                lea    (%ecx,%ebx,1),%eax
  2d3:   f3 0f 10 02             movss  (%edx),%xmm0
  2d7:   f3 0f 59 44 85 04       mulss  0x4(%ebp,%eax,4),%xmm0
  2dd:   f3 0f 58 c8             addss  %xmm0,%xmm1
  2e1:   83 c1 01                add    $0x1,%ecx
  2e4:   01 f2                   add    %esi,%edx
  2e6:   39 f9                   cmp    %edi,%ecx
  2e8:   75 e6                   jne    2d0 <MAIN__+0x2d0>


The "inner loop of the Ada code looks like


  af2:   83 c6 01                add    $0x1,%esi
  af5:   89 f0                   mov    %esi,%eax
  af7:   2b 44 24 28             sub    0x28(%esp),%eax
  afb:   03 44 24 30             add    0x30(%esp),%eax
  aff:   8b 5c 24 38             mov    0x38(%esp),%ebx
  b03:   f3 0f 10 0c 83          movss  (%ebx,%eax,4),%xmm1
  b08:   8b 4d 00                mov    0x0(%ebp),%ecx
  b0b:   8b 45 0c                mov    0xc(%ebp),%eax
  b0e:   8b 55 08                mov    0x8(%ebp),%edx
  b11:   8b 5c 24 78             mov    0x78(%esp),%ebx
  b15:   29 d3                   sub    %edx,%ebx
  b17:   89 f7                   mov    %esi,%edi
  b19:   29 cf                   sub    %ecx,%edi
  b1b:   89 f9                   mov    %edi,%ecx
  b1d:   83 c0 01                add    $0x1,%eax
  b20:   29 d0                   sub    %edx,%eax
  b22:   01 c0                   add    %eax,%eax
  b24:   01 c0                   add    %eax,%eax
  b26:   ba 00 00 00 00          mov    $0x0,%edx
  b2b:   0f 48 c2                cmovs  %edx,%eax
  b2e:   0f af c8                imul   %eax,%ecx
  b31:   8d 1c 99                lea    (%ecx,%ebx,4),%ebx
  b34:   8b 44 24 3c             mov    0x3c(%esp),%eax
  b38:   f3 0f 10 04 03          movss  (%ebx,%eax,1),%xmm0
  b3d:   f3 0f 59 c1             mulss  %xmm1,%xmm0
  b41:   f3 0f 58 d0             addss  %xmm0,%xmm2
  b45:   3b 74 24 7c             cmp    0x7c(%esp),%esi
  b49:   75 a7                   jne    af2 <_ada_tst_array+0x254>

28 Instructions v.s. 8 for fortran.

The GNAT version never stood a chance. It really seems like GNAT is 
dropping the ball here.

Granted small benchmarks can really lead one to believe things are 
better or worse than the truth but I don't think there is really an 
excuse in this case for this sort of performance.