From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=ham autolearn_force=no version=3.4.4
X-Google-Thread: 103376,103803355c3db607
X-Google-NewGroupId: yes
X-Google-Attributes: gida07f3367d7,domainid0,public,usenet
X-Google-Language: ENGLISH,ASCII-7-bit
Received: by 10.68.223.40 with SMTP id qr8mr5484138pbc.0.1342297047988;
        Sat, 14 Jul 2012 13:17:27 -0700 (PDT)
Path: 
 l9ni11739pbj.0!nntp.google.com!news2.google.com!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
From: Keean Schupke <keean.schupke@googlemail.com>
Newsgroups: comp.lang.ada
Subject: Re: GNAT (GCC) Profile Guided Compilation
Date: Sat, 14 Jul 2012 13:17:27 -0700 (PDT)
Organization: http://groups.google.com
Message-ID: <2dba1140-4f28-4fb8-ace4-2c10f3a02313@googlegroups.com>
References: <dac2857a-6f74-4ecb-a5d2-f6b73fbd0ecc@googlegroups.com>
 <dd9d3648-4538-4aa2-8a0e-557bed1799b3@googlegroups.com>
 <38b9c365-a2b2-4b8b-8d2a-1ea39d08ce86@googlegroups.com>
 <d15a813f-d697-4c80-ad7c-d110382b92d7@googlegroups.com>
 <982d531a-3972-4971-b802-c7e7778b8649@googlegroups.com>
 <520bdc39-6004-4142-a227-facf14ebb0e8@googlegroups.com>
 <4ff08cb2$0$6575$9b4e6d93@newsspool3.arcor-online.net>
 <a4f2a43e-5593-48f6-9e0f-7d0057874f94@googlegroups.com>
 <4ff1d731$0$6582$9b4e6d93@newsspool3.arcor-online.net>
 <cdbe38d2-c8b0-41b2-9830-d913aefa200c@googlegroups.com>
 <fed934c8-9cff-4905-811d-9f9d3050d0b1@googlegroups.com>
 <4ff41d38$0$6577$9b4e6d93@newsspool3.arcor-online.net>
 <26b778c4-5abc-4fbf-94b0-888c2ce71831@googlegroups.com>
 <4ff43956$0$6576$9b4e6d93@newsspool3.arcor-online.net>
NNTP-Posting-Host: 82.44.19.199
Mime-Version: 1.0
X-Trace: posting.google.com 1342297047 15840 127.0.0.1 (14 Jul 2012 20:17:27
 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Sat, 14 Jul 2012 20:17:27 +0000 (UTC)
In-Reply-To: <4ff43956$0$6576$9b4e6d93@newsspool3.arcor-online.net>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=82.44.19.199;
 posting-account=T5Z2vAoAAAB8ExE3yV3f56dVATtEMNcM
User-Agent: G2/1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Date: 2012-07-14T13:17:27-07:00
List-Id: <comp.lang.ada>

On Wednesday, 4 July 2012 13:38:45 UTC+1, Georg Bauhaus  wrote:
> On 04.07.12 12:57, Keean Schupke wrote:
> &gt; On Wednesday, 4 July 2012 11:38:57 UTC+1, Georg Bauhaus  wrote:
> &gt;&gt; On 03.07.12 01:48, Keean Schupke wrote:
> &gt;&gt;&gt; I have done some testing with the linux &quot;perf&quot; too=
l. These are some figures for the Ada version:
> &gt;&gt;&gt;
> &gt;&gt;&gt;           1,014,900 l1-dcache-load-misses     #    0.01% of =
all L1-dcache hits
> &gt;&gt;&gt;      12,462,973,199 l1-dcache-loads
> &gt;&gt;&gt;           7,311,495 cache-references
> &gt;&gt;&gt;              38,804 cache-misses              #    0.531 % o=
f all cache refs
> &gt;&gt;&gt;       2,588,686,069 branch-instructions
> &gt;&gt;&gt;         388,460,030 branch-misses             #   15.01% of =
all branches
> &gt;&gt;&gt;        21.885512117 seconds time elapsed
> &gt;&gt;&gt;
> &gt;&gt;&gt; And here are the results for the C++ version:
> &gt;&gt;&gt;
> &gt;&gt;&gt;             840,245 l1-dcache-load-misses     #    0.01% of =
all L1-dcache hits
> &gt;&gt;&gt;      11,140,761,995 l1-dcache-loads
> &gt;&gt;&gt;           6,019,321 cache-references
> &gt;&gt;&gt;              27,584 cache-misses              #    0.458 % o=
f all cache refs
> &gt;&gt;&gt;       3,049,597,029 branch-instructions
> &gt;&gt;&gt;         560,173,316 branch-misses             #   18.37% of =
all branches
> &gt;&gt;&gt;        17.823476294 seconds time elapsed
> &gt;&gt;&gt;
> &gt;&gt;&gt;
> &gt;&gt;&gt; So the interesting thing is that the Ada version has less ov=
erall branches and less branch misses than the C++ version, so it seems the=
 profile-guided compilation has achieved as much. There is another factor l=
imiting performance. The interesting figure would appear to be the cache-mi=
sses.
> &gt;&gt;&gt;
> &gt;&gt;&gt; So it would appear I need to focus on the cache utilisation =
of the Ada code.
> &gt;&gt;
> &gt;&gt; FWIW, looking at the 1D vs 2D subprograms in order to learn
> &gt;&gt; about a (dis)advantage of writing 2D arrays,I found some
> &gt;&gt; things potentially interesting.
> &gt;&gt;
> &gt;&gt; When there is no additional test in the loops,
> &gt;&gt; Apple&#39;s Instruments shows two orders of magnitude fewer
> &gt;&gt; branch instructions executed by the 2D subprogram
> &gt;&gt; compared to the 1D subprogram, 5M : 2G. This seems huge to me,
> &gt;&gt; but is reproducible. A naive look at the assembly listing offers
> &gt;&gt; some confirmation, mentioned below, though not on the same order=
.
> &gt;&gt;
> &gt;&gt; With the &quot;mod&quot; based test added to the respective loop=
s the number
> &gt;&gt; of branch instructions executed by the 2D subprogram increases
> &gt;&gt; to about one half of that of the 1D subprogram&#39;s. Still bett=
er.
> &gt;&gt;
> &gt;&gt; The assembly listing of the subprograms without tests added has
> &gt;&gt;
> &gt;&gt; - [compute_1d] 3 pairs of forward je and 1 backward jne near
> &gt;&gt;    the end
> &gt;&gt;
> &gt;&gt; - [compute_2] 1 pair of backward jne near the end,
> &gt;&gt;
> &gt;&gt; It appears that unrolling yields two somewhat differently
> &gt;&gt; structured lists of instructions, but I&#39;m drifting away
> &gt;&gt; from Ada.
> &gt;&gt;
> &gt;&gt; Compiling with profile data rearranges the jumps for 1D, adds ju=
mps to 2D,
> &gt;&gt; and shortens both procedures. However, this slows both down usin=
g the latest
> &gt;&gt; GNAT GPL on Core i7; there is some speed-up of the 1D procedure =
with
> &gt;&gt; Debian&#39;s GNAT 4.4.5 on Xeon E5645, though. (-O2 -funroll-loo=
ps -gnatp)
> &gt;&gt;
> &gt;&gt; All of this breaks once I turn on -O3.
> &gt;&gt; Not sure whether this is a lottery or a mine field. ;-)
> &gt;&gt;
> &gt;&gt; Cheers,
> &gt;&gt; Georg
> &gt;=20
> &gt;=20
> &gt; How can I turn off inlining for a function in GNAT?
>=20
> Sometimes by reordering code, making sure the body hasn&#39;t
> been seen when the compiler sees the call statement.
> Or try separate compilation.  The following arrangement
> appears to prevent inline expansion of Inc, even when
> just the main unit is fed to gnatmake -O3 -gnatNp, so that
> GNAT translates everything else automatically, using the
> same switches.
>=20
> -fno-inline is another switch to consider. However, it
> appears to be interfering with other optimizations (loop
> unrolling, vectorizer, from what I can guess).
>=20
> package Prevent_Inline is
>    type List is array (Positive range &lt;&gt;) of Integer;
>    procedure Inc (X : in out Integer);
>    procedure Inc_All (A : in out List);
> end Prevent_Inline;
>=20
> with Prevent_Inline.Aux;
> package body Prevent_Inline is
>=20
>    procedure Inc (X : in out Integer) is
>    begin
>       X :=3D X + 1;
>    end Inc;
>=20
>    procedure Inc_All (A : in out List)
>      renames Prevent_Inline.Aux;
>=20
> end Prevent_Inline;
>=20
> procedure Prevent_Inline.Aux (A : in out List) is
> begin
>    for X of A loop
>       Inc (X);
>    end loop;
> end Prevent_Inline.Aux;
>=20
> with Prevent_Inline;    use Prevent_Inline;
> procedure Test_Prevent_Inline is
>    X : List (1 .. 10);
> begin
>    Inc_All (X);
> end Test_Prevent_Inline;

Okay, I think I have tracked down the performance problem, but I am not sur=
e how to fix it. It would appear that C++ code that returns a boolean from =
a function, generates a decision tree using tests and branches, whereas Ada=
 is setting the result into a Boolean variable. This has the result that C+=
+ is bailing out of the evaluation as soon as it can (IE if one side of an =
and is false, or one side of an or is true), but Ada is always evaluating a=
ll parts of the expressions.

Is this a difference in language semantics, and what is the best way to dea=
l with it? Do I need to rewrite all 'and' and 'or' statements in conditiona=
ls as nested if statements to get the evaluate only as far as necessary sem=
antics like C/C++?


Cheers,
Keean.