From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail From: Paul Rubin Newsgroups: comp.lang.ada Subject: Re: Toy computational "benchmark" in Ada (new blog post) Date: Thu, 06 Jun 2019 21:50:04 -0700 Organization: A noiseless patient Spider Message-ID: <877e9yxamb.fsf@nightsong.com> References: <55b14350-e255-406c-ab11-b824da77995b@googlegroups.com> <6776b034-1318-49b3-8ff5-5a2f746fac9c@googlegroups.com> <87blzaxnei.fsf@nightsong.com> Mime-Version: 1.0 Content-Type: text/plain Injection-Info: reader02.eternal-september.org; posting-host="b0f0f522fb2e31df244925cac2903ee1"; logging-data="11348"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX181fGcolt/MCaqy/0p5f1+B" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) Cancel-Lock: sha1:K4XiP2QVgM6GjWz3a8btzsigcC0= sha1:RIAfe4CJnK5uqCZC6Sug3lKWj0M= Xref: reader01.eternal-september.org comp.lang.ada:56518 Date: 2019-06-06T21:50:04-07:00 List-Id: Paul Rubin writes: > Actual elapsed times were 2 min 5 sec for the single threaded [Ada] > version and 36.484s for the parallel version. The reported usermode > cpu times were 2m3.2s for the single threaded version and 2m18s > (across the 4 cores) for the parallel version. > > I'll see if I can code and time a C++ version. C++ version results: single threaded 45.73 seconds cpu, 47.36 sec elapsed. Multi-threaded (4 threads): 84.34 sec cpu, 23.11 sec elapsed. That's using GCC 6.03 with threading done by std::future's async function. So both versions are a fair amount faster than the Ada version, but the threading speedup is nowhere near as good. I wonder what's going on with that. At each of the 50 calculation runs, I launch 4 threads for slices of the array, then wait for them to complete and sum the results, so there might be a little bit of idle time if the threads don't all use the exact same amount of time. I also want to try with transform_map_reduce and maybe the new Intel TBB library with GCC 9. It is possible that transform_map_reduce can use Intel SIMD intrinsics but otherwise they can be called directly with some hassle. Also I can run on a machine with AVX512. I guess the next thing after that would be OpenCL or CUDA and a graphics card.