From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: Paul Rubin <no.email@nospam.invalid>
Newsgroups: comp.lang.ada
Subject: Re: Toy computational "benchmark" in Ada (new blog post)
Date: Thu, 06 Jun 2019 21:50:04 -0700
Organization: A noiseless patient Spider
Message-ID: <877e9yxamb.fsf@nightsong.com>
References: <55b14350-e255-406c-ab11-b824da77995b@googlegroups.com>
	<qdbt6v$7qa$1@dont-email.me>
	<6776b034-1318-49b3-8ff5-5a2f746fac9c@googlegroups.com>
	<87blzaxnei.fsf@nightsong.com>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org;
 posting-host="b0f0f522fb2e31df244925cac2903ee1";
	logging-data="11348"; mail-complaints-to="abuse@eternal-september.org";
	posting-account="U2FsdGVkX181fGcolt/MCaqy/0p5f1+B"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux)
Cancel-Lock: sha1:K4XiP2QVgM6GjWz3a8btzsigcC0=
	sha1:RIAfe4CJnK5uqCZC6Sug3lKWj0M=
Xref: reader01.eternal-september.org comp.lang.ada:56518
Date: 2019-06-06T21:50:04-07:00
List-Id: <comp.lang.ada>

Paul Rubin <no.email@nospam.invalid> writes:
> Actual elapsed times were 2 min 5 sec for the single threaded [Ada]
> version and 36.484s for the parallel version.  The reported usermode
> cpu times were 2m3.2s for the single threaded version and 2m18s
> (across the 4 cores) for the parallel version.
>
> I'll see if I can code and time a C++ version.

C++ version results: single threaded 45.73 seconds cpu, 47.36 sec elapsed.
Multi-threaded (4 threads): 84.34 sec cpu, 23.11 sec elapsed.

That's using GCC 6.03 with threading done by std::future's async
function.  So both versions are a fair amount faster than the Ada
version, but the threading speedup is nowhere near as good.  I wonder
what's going on with that.  At each of the 50 calculation runs, I launch
4 threads for slices of the array, then wait for them to complete and
sum the results, so there might be a little bit of idle time if the
threads don't all use the exact same amount of time.

I also want to try with transform_map_reduce and maybe the new Intel TBB
library with GCC 9.  It is possible that transform_map_reduce can use
Intel SIMD intrinsics but otherwise they can be called directly with
some hassle.  Also I can run on a machine with AVX512.

I guess the next thing after that would be OpenCL or CUDA and a graphics
card.