From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-0.9 required=5.0 tests=BAYES_00,FORGED_GMAIL_RCVD,
	FREEMAIL_FROM autolearn=no autolearn_force=no version=3.4.4
X-Received: by 2002:a5d:964d:: with SMTP id d13mr9720406ios.224.1559991387637;
        Sat, 08 Jun 2019 03:56:27 -0700 (PDT)
X-Received: by 2002:a9d:6c13:: with SMTP id f19mr22591811otq.76.1559991387322;
 Sat, 08 Jun 2019 03:56:27 -0700 (PDT)
Path: 
 eternal-september.org!reader01.eternal-september.org!feeder.eternal-september.org!weretis.net!feeder6.news.weretis.net!feeder.usenetexpress.com!feeder-in1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!s188no386876itb.0!news-out.google.com!l135ni439itc.0!nntp.google.com!g15no380539itd.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Sat, 8 Jun 2019 03:56:26 -0700 (PDT)
In-Reply-To: <10240625-5cff-4d5a-a144-f21a3b8b1a08@googlegroups.com>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com;
 posting-host=2001:8004:1420:515:9dd0:14f4:6c71:502b;
 posting-account=rfeywQoAAAC0TKn5ZjdVW0ytcQM1oMSv
NNTP-Posting-Host: 2001:8004:1420:515:9dd0:14f4:6c71:502b
References: <55b14350-e255-406c-ab11-b824da77995b@googlegroups.com>
 <10240625-5cff-4d5a-a144-f21a3b8b1a08@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2cbe1ae9-b12b-4e97-80df-1395784a2c01@googlegroups.com>
Subject: Re: Toy computational "benchmark" in Ada (new blog post)
From: David Trudgett <dktrudgett@gmail.com>
Injection-Date: Sat, 08 Jun 2019 10:56:27 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Xref: reader01.eternal-september.org comp.lang.ada:56556
Date: 2019-06-08T03:56:26-07:00
List-Id: <comp.lang.ada>

Il giorno sabato 8 giugno 2019 11:14:06 UTC+10, john  ha scritto:
> >I thought the -O3 would unroll loops where appropriate. Is that not the =
case?
>=20
> Not on gcc. Unrolling doesn't seem to help much though.

Yes, there is nothing to unroll, except a massive one billion iteration loo=
p, which (a) would not be reasonable to unroll, and (b) would most likely c=
ause a performance hit if it were, due to cache effects.

>=20
> >I assume that native arch means it will generate optimal instructions fo=
r the >particular architecture on which the compile is running?
>=20
> Sometimes it makes things worse! Though that's rare. Sometimes it helps a=
 little. That's my experience, which is pretty limited.

In this case, I see that it generated AVX instructions instead of SSE, but =
there was no speed gain as a result.


>=20
> >Ah yes. I used the heap because I didn't want to use such a huge stack (=
and got >the expected error message when I tried anyway). But I wonder why =
the heap >should be any slower? I can't see any reason why it would be.
>=20
> CPUs and compilers are so complex now that I never know
> for sure what's going on. The interesting thing here is
> that the array is almost entirely in RAM, which makes floating
> point desperately slow.

With linear RAM access, I would expect the cache to be pre-populated/fetche=
d by the predictive caching mechanisms of the CPU. The fact that the CPU is=
 pegged at 100% during the (single threaded) calculation would seem to supp=
ort this idea and indicate that RAM is supplying data at a fast enough rate=
. In the multithreaded version, each CPU was about 1% idle, presumably due =
to some SMP contention issues (maybe bus bandwidth limitations or something=
 like that). (It is my understanding as a non-hardware specialist that it i=
s usually RAM latency that is the real performance killer, and not the theo=
retical raw throughput potential, which is rarely achieved.) It does seem t=
o me that processing 8GiB worth of floating point values, doing a multiply =
and add for each one in under half a second using SSE2 instructions, is pre=
tty good, really.

>=20
> If you compile the 2 programs below with the -S switch,
> and read the .s file, then you find that gcc produdes SSE code
> for both the C and Ada programs.  In other words you
> see instructions like:
>    vmulsd  %xmm0, %xmm0, %xmm0
>    vaddsd  %xmm0, %xmm1, %xmm1

Those are AVX instructions, I think. (SSE would be MULSD and ADDSD, as I un=
derstand it.)


> That won't help much if fetching memory from RAM is too slow
> to keep the multipliers busy.=20
>=20
> If you compile with the -mfpmath=3D387 switch, then no SSE code
> is generated, and the running time is about the same. (On my
> machine.)
>=20
> When you compare programs in different languages, you need to
> write them the same. See below! I get identical run times from
> the two with all the compiler switches I try, as long as they
> are the same compiler switches. You can try various combinations
> of O2, O3, -mfpmath=3D387 etc:

Yeah. We are not really comparing languages here, though, but the instructi=
ons that are generated by the compiler. I'm sure that if we got GNAT to gen=
erate AVX2 or AVX512 instructions, then the performance would be same as th=
e AVX2 code generated by the MS C compiler.

We have to bear in mind, though, that there are limited reasons to want to =
achieve that level of custom binary, because it is more often the case one =
would want a program to run on a variety of processors within the same fami=
ly. Of course, a clever compiler could in theory perhaps compile several va=
riants and choose between them at run time.

>=20
>   gnatmake -O3 -march=3Dnative -funroll-loops map.adb
>   gcc -O3 -march=3Dnative -funroll-loops -march=3Dnative map.c
>=20
> and remember to make room for the arrays on the stack. On the
> bash shell, it's ulimit -s unlimited. On linux, timing
> with 'time ./a.out' and 'time ./map' works ok, but run them
> repeatedly, and remove any background processes, (like browsers!)
>=20
> #include <stdio.h>
> double main()
> {
>     int Calculation_Runs =3D 100;
>     int Data_Points =3D 320000000;
>     int i, j;
>     double s;
>     double v[Data_Points];
>=20
>     for (i=3D0; i<Data_Points; i++){
>       v[i] =3D 3.14159265358979323846;
>     }
>=20
>     for (j=3D0; j<Calculation_Runs; j++){
>         for (i=3D0; i<Data_Points; i++){
>           s =3D s + v[i] * v[i];
>         }
>     }
>     printf("Sum =3D %f",s);
> }
>=20
> with Ada.Text_IO; use Ada.Text_IO;
> procedure Map is
>    Calculation_Runs : constant :=3D 100;
>    Data_Points : constant :=3D 320_000_000;
>=20
>    type Values_Index is range 1 .. Data_Points;
>    type Float64 is digits 15;
>    type Values_Array_Type is array (Values_Index) of Float64;
>    Values_Array : Values_Array_Type;
>    Sum : Float64 :=3D 0.0;
> begin
>    for i in Values_Index loop
>       Values_Array (i) :=3D 3.14159265358979323846;
>    end loop;
>=20
>    for j in 1 .. Calculation_Runs loop
>    for i in Values_Index loop
>       Sum :=3D Sum + Values_Array(i) * Values_Array(i);
>    end loop;
>    end loop;
>    Put_Line ("Sum =3D " & Sum'Image);
> end Map;

Note that there is no timing in either of those versions, and so if you are=
 using a shell timer (as in bash: "$ time ./myprog"), you are probably not =
getting a good resolution, and more importantly timing other things besides=
 the "map reduce" calculation, which were specifically desired to be exclud=
ed.

Cheers,
David