From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=unavailable autolearn_force=no version=3.4.4
X-Received: by 10.107.3.225 with SMTP id e94mr9847331ioi.7.1519093884861;
        Mon, 19 Feb 2018 18:31:24 -0800 (PST)
X-Received: by 10.157.112.141 with SMTP id l13mr912719otj.1.1519093884738;
 Mon, 19 Feb 2018 18:31:24 -0800 (PST)
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!feeder.eternal-september.org!news.uzoreto.com!weretis.net!feeder6.news.weretis.net!feeder.usenetexpress.com!feeder-in1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!o66no1676485ita.0!news-out.google.com!s63ni4549itb.0!nntp.google.com!w142no1682268ita.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Mon, 19 Feb 2018 18:31:24 -0800 (PST)
In-Reply-To: <a4ccd7ab-59a7-486b-afd9-41737dbdb706@googlegroups.com>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com;
 posting-host=87.116.179.50;
 posting-account=z-xFXQkAAABpEOAnT3LViyFXc8dmoW_p
NNTP-Posting-Host: 87.116.179.50
References: <ef267893-03f7-48d1-8178-d69035e7c5fb@googlegroups.com>
 <a7b3cac4-b108-4941-a498-25d402b3b409@googlegroups.com>
 <p6bku6$s7j$1@dont-email.me>
 <83493d20-7001-405b-8658-8a3f5d6c90fa@googlegroups.com>
 <p6bv74$ac$1@dont-email.me>
 <b819f146-3c07-49ea-bb1a-cfe497e2df3b@googlegroups.com>
 <p6csbl$ls6$1@gioia.aioe.org>
 <a4ccd7ab-59a7-486b-afd9-41737dbdb706@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <06efbe02-cdae-4fac-a17d-6d0c1be7848c@googlegroups.com>
Subject: Re: GNAT can't vectorize Real_Matrix multiplication from
 Ada.Numerics.Real_Arrays. What a surprise!
From: Bojan Bozovic <bozovic.bojan@gmail.com>
Injection-Date: Tue, 20 Feb 2018 02:31:24 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Xref: reader02.eternal-september.org comp.lang.ada:50513
Date: 2018-02-19T18:31:24-08:00
List-Id: <comp.lang.ada>

On Monday, February 19, 2018 at 10:08:41 PM UTC+1, Robert Eachus wrote:
> On Sunday, February 18, 2018 at 4:48:42 PM UTC-5, Nasser M. Abbasi wrote:
> > On 2/18/2018 1:38 PM, Bojan Bozovic wrote:
> >=20
> > If you are doing A*B by hand, then you are doing something
> > wrong. Almost all languages end up calling Blas
> > Fortran libraries for these operations. Your code and
> > the Ada code can't be faster.
> >=20
> > http://www.netlib.org/blas/
> >=20
> > Intel Math Kernel Library has all these.
> >=20
> > https://en.wikipedia.org/wiki/Math_Kernel_Library
>=20
> For multiplying two small matrices, blas is overkill and will be slower. =
 If you have say, 1000x1000 matrices, then you should be using blas.  But w=
hich BLAS?  Intel and AMD both have math libraries optimized for their CPUs=
.  However, I tend to use ATLAS.  ATLAS will build a blas targeted at your =
specific hardware.  This is not just about instruction set additions like S=
IMD2.  It will tailor the implementation to your number of cores and suppor=
ted threads, cache sizes, and memory speeds.  I've also used the goto blas,=
 but ATLAS even though not perfect, builds all of blas3 using matrix multip=
lication and blas2, such that all operations slower than O(n^2) have their =
speed determined by matrix multiplication.  (Then use multiple matrix multi=
plication codes with different parameters to find the fastest.)
>=20
> Usually hardware vendor libraries catch up to and surpass ATLAS, but by t=
hen the hardware is obsolete. :-(   The other problem right now is that bla=
s libraries are pretty dumb when it comes to multiprocessor systems.  I'm w=
orking on fixing that. ;-)

I have looked at ATLAS, however it can't spawn more threads than specified =
at compile time, so there's lots of possibility to optimize there, by spawn=
ing as many threads as supported at run-time. Ada would do much better here=
 than C, because you could make portable code without resorting to ugly hac=
ks of C, and using parallelism no matter whats the underlying processor arc=
hitecture. That are my $0.02, worthless or not (and if you want to use asse=
mbler to "optimize" further in C, that can be done in any language, which I=
 fear Intel MKL library and other vendor libraries do).