comp.lang.ada
 help / color / mirror / Atom feed
From: David Trudgett <dktrudgett@gmail.com>
Subject: Re: Toy computational "benchmark" in Ada (new blog post)
Date: Sat, 8 Jun 2019 03:56:26 -0700 (PDT)
Date: 2019-06-08T03:56:26-07:00	[thread overview]
Message-ID: <2cbe1ae9-b12b-4e97-80df-1395784a2c01@googlegroups.com> (raw)
In-Reply-To: <10240625-5cff-4d5a-a144-f21a3b8b1a08@googlegroups.com>

Il giorno sabato 8 giugno 2019 11:14:06 UTC+10, john  ha scritto:
> >I thought the -O3 would unroll loops where appropriate. Is that not the case?
> 
> Not on gcc. Unrolling doesn't seem to help much though.

Yes, there is nothing to unroll, except a massive one billion iteration loop, which (a) would not be reasonable to unroll, and (b) would most likely cause a performance hit if it were, due to cache effects.

> 
> >I assume that native arch means it will generate optimal instructions for the >particular architecture on which the compile is running?
> 
> Sometimes it makes things worse! Though that's rare. Sometimes it helps a little. That's my experience, which is pretty limited.

In this case, I see that it generated AVX instructions instead of SSE, but there was no speed gain as a result.



> 
> >Ah yes. I used the heap because I didn't want to use such a huge stack (and got >the expected error message when I tried anyway). But I wonder why the heap >should be any slower? I can't see any reason why it would be.
> 
> CPUs and compilers are so complex now that I never know
> for sure what's going on. The interesting thing here is
> that the array is almost entirely in RAM, which makes floating
> point desperately slow.

With linear RAM access, I would expect the cache to be pre-populated/fetched by the predictive caching mechanisms of the CPU. The fact that the CPU is pegged at 100% during the (single threaded) calculation would seem to support this idea and indicate that RAM is supplying data at a fast enough rate. In the multithreaded version, each CPU was about 1% idle, presumably due to some SMP contention issues (maybe bus bandwidth limitations or something like that). (It is my understanding as a non-hardware specialist that it is usually RAM latency that is the real performance killer, and not the theoretical raw throughput potential, which is rarely achieved.) It does seem to me that processing 8GiB worth of floating point values, doing a multiply and add for each one in under half a second using SSE2 instructions, is pretty good, really.

> 
> If you compile the 2 programs below with the -S switch,
> and read the .s file, then you find that gcc produdes SSE code
> for both the C and Ada programs.  In other words you
> see instructions like:
>    vmulsd  %xmm0, %xmm0, %xmm0
>    vaddsd  %xmm0, %xmm1, %xmm1

Those are AVX instructions, I think. (SSE would be MULSD and ADDSD, as I understand it.)


> That won't help much if fetching memory from RAM is too slow
> to keep the multipliers busy. 
> 
> If you compile with the -mfpmath=387 switch, then no SSE code
> is generated, and the running time is about the same. (On my
> machine.)
> 
> When you compare programs in different languages, you need to
> write them the same. See below! I get identical run times from
> the two with all the compiler switches I try, as long as they
> are the same compiler switches. You can try various combinations
> of O2, O3, -mfpmath=387 etc:

Yeah. We are not really comparing languages here, though, but the instructions that are generated by the compiler. I'm sure that if we got GNAT to generate AVX2 or AVX512 instructions, then the performance would be same as the AVX2 code generated by the MS C compiler.

We have to bear in mind, though, that there are limited reasons to want to achieve that level of custom binary, because it is more often the case one would want a program to run on a variety of processors within the same family. Of course, a clever compiler could in theory perhaps compile several variants and choose between them at run time.

> 
>   gnatmake -O3 -march=native -funroll-loops map.adb
>   gcc -O3 -march=native -funroll-loops -march=native map.c
> 
> and remember to make room for the arrays on the stack. On the
> bash shell, it's ulimit -s unlimited. On linux, timing
> with 'time ./a.out' and 'time ./map' works ok, but run them
> repeatedly, and remove any background processes, (like browsers!)
> 
> #include <stdio.h>
> double main()
> {
>     int Calculation_Runs = 100;
>     int Data_Points = 320000000;
>     int i, j;
>     double s;
>     double v[Data_Points];
> 
>     for (i=0; i<Data_Points; i++){
>       v[i] = 3.14159265358979323846;
>     }
> 
>     for (j=0; j<Calculation_Runs; j++){
>         for (i=0; i<Data_Points; i++){
>           s = s + v[i] * v[i];
>         }
>     }
>     printf("Sum = %f",s);
> }
> 
> with Ada.Text_IO; use Ada.Text_IO;
> procedure Map is
>    Calculation_Runs : constant := 100;
>    Data_Points : constant := 320_000_000;
> 
>    type Values_Index is range 1 .. Data_Points;
>    type Float64 is digits 15;
>    type Values_Array_Type is array (Values_Index) of Float64;
>    Values_Array : Values_Array_Type;
>    Sum : Float64 := 0.0;
> begin
>    for i in Values_Index loop
>       Values_Array (i) := 3.14159265358979323846;
>    end loop;
> 
>    for j in 1 .. Calculation_Runs loop
>    for i in Values_Index loop
>       Sum := Sum + Values_Array(i) * Values_Array(i);
>    end loop;
>    end loop;
>    Put_Line ("Sum = " & Sum'Image);
> end Map;

Note that there is no timing in either of those versions, and so if you are using a shell timer (as in bash: "$ time ./myprog"), you are probably not getting a good resolution, and more importantly timing other things besides the "map reduce" calculation, which were specifically desired to be excluded.

Cheers,
David


      reply	other threads:[~2019-06-08 10:56 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-06 11:05 Toy computational "benchmark" in Ada (new blog post) David Trudgett
2019-06-06 17:48 ` Olivier Henley
2019-06-06 23:14   ` David Trudgett
2019-06-06 23:27     ` Paul Rubin
2019-06-07  5:24       ` David Trudgett
2019-06-07  5:36         ` Paul Rubin
2019-06-06 20:31 ` Jeffrey R. Carter
2019-06-06 23:02   ` David Trudgett
2019-06-07  0:13     ` Paul Rubin
2019-06-07  4:50       ` Paul Rubin
2019-06-07  5:41         ` David Trudgett
2019-06-07  6:00           ` Paul Rubin
2019-06-07  6:25             ` David Trudgett
2019-06-07  6:38               ` Paul Rubin
2019-06-07  5:28       ` David Trudgett
2019-06-07  5:57         ` Paul Rubin
2019-06-07  6:21           ` David Trudgett
2019-06-07  6:22             ` Paul Rubin
2019-06-07  6:29               ` David Trudgett
2019-06-07  6:42                 ` Paul Rubin
2019-06-07 17:55     ` Jeffrey R. Carter
2019-06-08 11:00       ` David Trudgett
2019-06-07  1:42 ` johnscpg
2019-06-07  5:34   ` David Trudgett
2019-06-08 10:17     ` David Trudgett
2019-06-08  1:14 ` johnscpg
2019-06-08 10:56   ` David Trudgett [this message]
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox