From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=0.4 required=5.0 tests=BAYES_00,FORGED_MUA_MOZILLA autolearn=no autolearn_force=no version=3.4.4 X-Google-Thread: 103376,103803355c3db607 X-Google-NewGroupId: yes X-Google-Attributes: gida07f3367d7,domainid0,public,usenet X-Google-Language: ENGLISH,ASCII-7-bit Received: by 10.68.190.104 with SMTP id gp8mr6183469pbc.4.1341398354571; Wed, 04 Jul 2012 03:39:14 -0700 (PDT) Path: l9ni10838pbj.0!nntp.google.com!news1.google.com!news4.google.com!feeder1.cambriumusenet.nl!feed.tweaknews.nl!217.73.144.44.MISMATCH!ecngs!feeder.ecngs.de!news.osn.de!diablo2.news.osn.de!proxad.net!feeder2-2.proxad.net!newsfeed.arcor.de!newsspool4.arcor-online.net!news.arcor.de.POSTED!not-for-mail Date: Wed, 04 Jul 2012 12:38:57 +0200 From: Georg Bauhaus User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:13.0) Gecko/20120614 Thunderbird/13.0.1 MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: GNAT (GCC) Profile Guided Compilation References: <38b9c365-a2b2-4b8b-8d2a-1ea39d08ce86@googlegroups.com> <982d531a-3972-4971-b802-c7e7778b8649@googlegroups.com> <520bdc39-6004-4142-a227-facf14ebb0e8@googlegroups.com> <4ff08cb2$0$6575$9b4e6d93@newsspool3.arcor-online.net> <4ff1d731$0$6582$9b4e6d93@newsspool3.arcor-online.net> In-Reply-To: Message-ID: <4ff41d38$0$6577$9b4e6d93@newsspool3.arcor-online.net> Organization: Arcor NNTP-Posting-Date: 04 Jul 2012 12:38:49 CEST NNTP-Posting-Host: 1ded801d.newsspool3.arcor-online.net X-Trace: DXC=mDeTmbc?_ljHiLPCY\c7>ejVH>Ra@5[>jGjI:VK6d;0Bd^L X-Complaints-To: usenet-abuse@arcor.de Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Date: 2012-07-04T12:38:49+02:00 List-Id: On 03.07.12 01:48, Keean Schupke wrote: > I have done some testing with the linux "perf" tool. These are some figures for the Ada version: > > 1,014,900 l1-dcache-load-misses # 0.01% of all L1-dcache hits > 12,462,973,199 l1-dcache-loads > 7,311,495 cache-references > 38,804 cache-misses # 0.531 % of all cache refs > 2,588,686,069 branch-instructions > 388,460,030 branch-misses # 15.01% of all branches > 21.885512117 seconds time elapsed > > And here are the results for the C++ version: > > 840,245 l1-dcache-load-misses # 0.01% of all L1-dcache hits > 11,140,761,995 l1-dcache-loads > 6,019,321 cache-references > 27,584 cache-misses # 0.458 % of all cache refs > 3,049,597,029 branch-instructions > 560,173,316 branch-misses # 18.37% of all branches > 17.823476294 seconds time elapsed > > > So the interesting thing is that the Ada version has less overall branches and less branch misses than the C++ version, so it seems the profile-guided compilation has achieved as much. There is another factor limiting performance. The interesting figure would appear to be the cache-misses. > > So it would appear I need to focus on the cache utilisation of the Ada code. FWIW, looking at the 1D vs 2D subprograms in order to learn about a (dis)advantage of writing 2D arrays,I found some things potentially interesting. When there is no additional test in the loops, Apple's Instruments shows two orders of magnitude fewer branch instructions executed by the 2D subprogram compared to the 1D subprogram, 5M : 2G. This seems huge to me, but is reproducible. A naive look at the assembly listing offers some confirmation, mentioned below, though not on the same order. With the "mod" based test added to the respective loops the number of branch instructions executed by the 2D subprogram increases to about one half of that of the 1D subprogram's. Still better. The assembly listing of the subprograms without tests added has - [compute_1d] 3 pairs of forward je and 1 backward jne near the end - [compute_2] 1 pair of backward jne near the end, It appears that unrolling yields two somewhat differently structured lists of instructions, but I'm drifting away from Ada. Compiling with profile data rearranges the jumps for 1D, adds jumps to 2D, and shortens both procedures. However, this slows both down using the latest GNAT GPL on Core i7; there is some speed-up of the 1D procedure with Debian's GNAT 4.4.5 on Xeon E5645, though. (-O2 -funroll-loops -gnatp) All of this breaks once I turn on -O3. Not sure whether this is a lottery or a mine field. ;-) Cheers, Georg