From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 2002:a24:8343:: with SMTP id d64mr1438599ite.4.1559989024103; Sat, 08 Jun 2019 03:17:04 -0700 (PDT) X-Received: by 2002:aca:5f07:: with SMTP id t7mr6094265oib.175.1559989023702; Sat, 08 Jun 2019 03:17:03 -0700 (PDT) Path: eternal-september.org!reader01.eternal-september.org!feeder.eternal-september.org!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!peer01.ams1!peer.ams1.xlned.com!news.xlned.com!peer01.am4!peer.am4.highwinds-media.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!s188no382462itb.0!news-out.google.com!l135ni437itc.0!nntp.google.com!s188no382458itb.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Sat, 8 Jun 2019 03:17:03 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=2001:8004:1420:515:9dd0:14f4:6c71:502b; posting-account=rfeywQoAAAC0TKn5ZjdVW0ytcQM1oMSv NNTP-Posting-Host: 2001:8004:1420:515:9dd0:14f4:6c71:502b References: <55b14350-e255-406c-ab11-b824da77995b@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <64388ca5-ae30-4451-9883-b5785e96de50@googlegroups.com> Subject: Re: Toy computational "benchmark" in Ada (new blog post) From: David Trudgett Injection-Date: Sat, 08 Jun 2019 10:17:04 +0000 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Received-Bytes: 4228 X-Received-Body-CRC: 1545764513 Xref: reader01.eternal-september.org comp.lang.ada:56555 Date: 2019-06-08T03:17:03-07:00 List-Id: Il giorno venerd=C3=AC 7 giugno 2019 15:34:22 UTC+10, David Trudgett ha scr= itto: > Il giorno venerd=C3=AC 7 giugno 2019 11:42:07 UTC+10, john...@googlemail.= com ha scritto: > >=20 > > On my machine I get a nice improvement over -O3 when I > > take the arrays off the heap, and then use the following 2 flags: > >=20 > > -march=3Dnative -funroll-loops >=20 > That's interesting. Thank you. I'll try that (and your mods below) over t= he weekend and see what the result is for me. >=20 > I thought the -O3 would unroll loops where appropriate. Is that not the c= ase? >=20 > I assume that native arch means it will generate optimal instructions for= the particular architecture on which the compile is running? >=20 > >=20 > > Modifying the programs is easy: > >=20 > > --Values_Array : Values_Array_Access :=3D new Values_Array_Type; > > Values_Array : Values_Array_Type; > >=20 > > In the parallel version, change the loop in the task body: > >=20 > > -- declare > > -- Val : Float64 renames Values_Array (Idx); > > -- begin > > My_Sum :=3D My_Sum + Values_Array (Idx) ** 2; > > -- end; > >=20 > > The -funroll-loops gave me a nice improvement on the parallel > > program, less so on the serial version. (Makes no sense to me > > at all!) If you are running in a Unix shell, you usually need > > to tell the system if you're going to put giant arrays on the > > stack. I type this on the command line: ulimit -s unlimited. >=20 > Ah yes. I used the heap because I didn't want to use such a huge stack (a= nd got the expected error message when I tried anyway). But I wonder why th= e heap should be any slower? I can't see any reason why it would be. >=20 >=20 Okay, I have tried (a) using stack allocation instead of heap; and (b) arch= =3Dnative compilation; and I have compared the resulting timings with the o= riginal. The timing results were as follows, and represent running the prog= ram three times in a row, and then averaging the reported run times, so tha= t, in effect, the average is for 150 calculation runs all together (50 calc= runs per program run). Original program: 434.718 ms Stack allocation: 435.667 ms Native arch flag: 435.745 ms As you can see, there is virtually no difference, and I did verify that the= native architecture compilation did, in fact, use AVX instructions rather = than SSE (but not AVX2). It's interesting that AVX instructions did not cause any improvement in run= time (it technically added 1 ms, but I expect that to be statistically ins= ignificant). Cheers, David