From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=0.4 required=5.0 tests=BAYES_00,FORGED_MUA_MOZILLA
	autolearn=no autolearn_force=no version=3.4.4
X-Google-Thread: 103376,103803355c3db607
X-Google-NewGroupId: yes
X-Google-Attributes: gida07f3367d7,domainid0,public,usenet
X-Google-Language: ENGLISH,ASCII-7-bit
Received: by 10.68.190.104 with SMTP id gp8mr6183469pbc.4.1341398354571;
        Wed, 04 Jul 2012 03:39:14 -0700 (PDT)
Path: 
 l9ni10838pbj.0!nntp.google.com!news1.google.com!news4.google.com!feeder1.cambriumusenet.nl!feed.tweaknews.nl!217.73.144.44.MISMATCH!ecngs!feeder.ecngs.de!news.osn.de!diablo2.news.osn.de!proxad.net!feeder2-2.proxad.net!newsfeed.arcor.de!newsspool4.arcor-online.net!news.arcor.de.POSTED!not-for-mail
Date: Wed, 04 Jul 2012 12:38:57 +0200
From: Georg Bauhaus <rm.dash-bauhaus@futureapps.de>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:13.0) Gecko/20120614 Thunderbird/13.0.1
MIME-Version: 1.0
Newsgroups: comp.lang.ada
Subject: Re: GNAT (GCC) Profile Guided Compilation
References: <dac2857a-6f74-4ecb-a5d2-f6b73fbd0ecc@googlegroups.com>
 <dd9d3648-4538-4aa2-8a0e-557bed1799b3@googlegroups.com>
 <38b9c365-a2b2-4b8b-8d2a-1ea39d08ce86@googlegroups.com>
 <d15a813f-d697-4c80-ad7c-d110382b92d7@googlegroups.com>
 <982d531a-3972-4971-b802-c7e7778b8649@googlegroups.com>
 <520bdc39-6004-4142-a227-facf14ebb0e8@googlegroups.com>
 <4ff08cb2$0$6575$9b4e6d93@newsspool3.arcor-online.net>
 <a4f2a43e-5593-48f6-9e0f-7d0057874f94@googlegroups.com>
 <4ff1d731$0$6582$9b4e6d93@newsspool3.arcor-online.net>
 <cdbe38d2-c8b0-41b2-9830-d913aefa200c@googlegroups.com>
 <fed934c8-9cff-4905-811d-9f9d3050d0b1@googlegroups.com>
In-Reply-To: <fed934c8-9cff-4905-811d-9f9d3050d0b1@googlegroups.com>
Message-ID: <4ff41d38$0$6577$9b4e6d93@newsspool3.arcor-online.net>
Organization: Arcor
NNTP-Posting-Date: 04 Jul 2012 12:38:49 CEST
NNTP-Posting-Host: 1ded801d.newsspool3.arcor-online.net
X-Trace: 
 DXC=mDeTmbc?<OM[6=1B@oB@@@McF=Q^Z^V3H4Fo<]lROoRA8kF<OcfhCOK80[U>_ljHiLPCY\c7>ejVH>Ra@5[>jGjI:VK6d;0Bd^L
X-Complaints-To: usenet-abuse@arcor.de
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Date: 2012-07-04T12:38:49+02:00
List-Id: <comp.lang.ada>

On 03.07.12 01:48, Keean Schupke wrote:
> I have done some testing with the linux "perf" tool. These are some figures for the Ada version:
>
>           1,014,900 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
>      12,462,973,199 l1-dcache-loads
>           7,311,495 cache-references
>              38,804 cache-misses              #    0.531 % of all cache refs
>       2,588,686,069 branch-instructions
>         388,460,030 branch-misses             #   15.01% of all branches
>        21.885512117 seconds time elapsed
>
> And here are the results for the C++ version:
>
>             840,245 l1-dcache-load-misses     #    0.01% of all L1-dcache hits
>      11,140,761,995 l1-dcache-loads
>           6,019,321 cache-references
>              27,584 cache-misses              #    0.458 % of all cache refs
>       3,049,597,029 branch-instructions
>         560,173,316 branch-misses             #   18.37% of all branches
>        17.823476294 seconds time elapsed
>
>
> So the interesting thing is that the Ada version has less overall branches and less branch misses than the C++ version, so it seems the profile-guided compilation has achieved as much. There is another factor limiting performance. The interesting figure would appear to be the cache-misses.
>
> So it would appear I need to focus on the cache utilisation of the Ada code.

FWIW, looking at the 1D vs 2D subprograms in order to learn
about a (dis)advantage of writing 2D arrays,I found some
things potentially interesting.

When there is no additional test in the loops,
Apple's Instruments shows two orders of magnitude fewer
branch instructions executed by the 2D subprogram
compared to the 1D subprogram, 5M : 2G. This seems huge to me,
but is reproducible. A naive look at the assembly listing offers
some confirmation, mentioned below, though not on the same order.

With the "mod" based test added to the respective loops the number
of branch instructions executed by the 2D subprogram increases
to about one half of that of the 1D subprogram's. Still better.

The assembly listing of the subprograms without tests added has

- [compute_1d] 3 pairs of forward je and 1 backward jne near
   the end

- [compute_2] 1 pair of backward jne near the end,

It appears that unrolling yields two somewhat differently
structured lists of instructions, but I'm drifting away
from Ada.

Compiling with profile data rearranges the jumps for 1D, adds jumps to 2D,
and shortens both procedures. However, this slows both down using the latest
GNAT GPL on Core i7; there is some speed-up of the 1D procedure with
Debian's GNAT 4.4.5 on Xeon E5645, though. (-O2 -funroll-loops -gnatp)

All of this breaks once I turn on -O3.
Not sure whether this is a lottery or a mine field. ;-)

Cheers,
Georg