From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=unavailable autolearn_force=no version=3.4.4
X-Received: by 10.107.161.79 with SMTP id k76mr14509406ioe.51.1501024798230;
        Tue, 25 Jul 2017 16:19:58 -0700 (PDT)
X-Received: by 10.36.50.19 with SMTP id j19mr456408ita.11.1501024798196; Tue,
 25 Jul 2017 16:19:58 -0700 (PDT)
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!news.glorb.com!t6no1025459itb.0!news-out.google.com!f71ni5107itc.0!nntp.google.com!t6no1025453itb.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Tue, 25 Jul 2017 16:19:57 -0700 (PDT)
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com;
 posting-host=2601:191:8303:2100:5985:2c17:9409:aa9c;
 posting-account=fdRd8woAAADTIlxCu9FgvDrUK4wPzvy3
NNTP-Posting-Host: 2601:191:8303:2100:5985:2c17:9409:aa9c
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <9e51f87c-3b54-4d09-b9ca-e3c6a6e8940a@googlegroups.com>
Subject: Real tasking problems with Ada.
From: Robert Eachus <rieachus@comcast.net>
Injection-Date: Tue, 25 Jul 2017 23:19:58 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Xref: news.eternal-september.org comp.lang.ada:47511
Date: 2017-07-25T16:19:57-07:00
List-Id: <comp.lang.ada>

This may come across as a rant.  I've tried to keep it down, but a little b=
it of ranting is probably appropriate.  Some of these problems could have b=
een fixed earlier, but there are some which are only coming of age with new=
 CPU and GPU designs, notably the Zen family of CPUs from AMD.

Let me first explain where I am coming from (at least in this thread).  I w=
ant to write code that takes full advantage of the features available to ru=
n code as fast as possible.  In particular I'd like to get the time to run =
some embarrassingly parallel routines in less than twice the total CPU time=
 of a single CPU.  (In other words, a problem that takes T seconds (wall cl=
ock) on a single CPU should run in less than 2T/N wall clock time on N proc=
essors. Oh, and it shouldn't generate garbage.  I'm working on a test progr=
am and it did generate garbage once when I didn't guess some of the system =
parameters right.=20

So what needs to be fixed.  First, it should be possible to assign tasks in=
 arrays to CPUs.  With a half dozen CPU cores the current facilities are ir=
ksome.  Oh, and when doing the assignment it would be nice to ask for the f=
acilities you need, rather than have per manufacturer's family code.  Just =
to start with, AMD has some processors which share floating-point units, so=
 you want to run code on alternate CPU cores on those machines--if the task=
s make heavy use of floating point.

Intel makes some processors with Hyperthreading, and some without, even wit=
hin the same processor family.  Hyperthreading does let you get some extra =
performance out if you know what you are doing, but much of the time you wi=
ll want the Hyperthreading alternate to do background and OS processing whi=
le allocating your heavy hitting work threads to the main threads.

Now look at AMD's Zen family.  Almost all available today have two threads =
per core like Intel's Hyperthreading, but these are much more like twins.  =
With random loads, each thread will do about the same amount of work.  Howe=
ver, if you know what you are doing, you can write code which usefully hogs=
 all of the core's resources if it can.  Back to running on alternate cores=
...

I know that task array types were considered in Ada 9X.  I don't know what =
happened to them.  But even without them, two huge improvements would be:

   1) Add a function Current_CPU or whatever (to System.Multiprocessors) th=
at returns the identity of the CPU this task is running on.  Obviously in a=
 rendezvous with a protected object, the function would return the ID of th=
e caller.  Probably do the same thing in a rendezvous between two tasks for=
 consistency.  Note that Get_ID function in System.Multiprocessors.Dispatch=
ing_Domains does that but it requires adding three (unnecessary) packages (=
DD, Ada.Real_Time, and Ada.Task_Identification) to your context without rea=
lly using anything there.=20

   2) Allow a task to  its CPU assignment after it has started execution.  =
It is no big deal if a task starts on a different CPU than the one it will =
spend the rest of its life on.  At a minimum Set_CPU(Current_CPU) or just S=
et_CPU should cause the task to be anchored to its current CPU core.  Note =
that again you can do this with Dispatching_Domains.

   Stretch goal:  Make it possible to assign tasks to a specific pair of th=
reads. In theory Dispatching_Domains does this, but the environment task me=
sses things up a bit.  You need to leave the partner of the environment tas=
k's CPU core in the default dispatching domain.  The problem is that there =
is no guarantee that the environment task is running on CPU 1 (or CPU 0, th=
e way the hardware numbers them).

Next, a huge problem.  I just had some code churn out garbage while I was f=
inding the "right" settings to get each chunk of work to have its own porti=
on of an array.  Don't tell me how to do this safely, if you do you are mis=
sing the point.  If each cache line is only written to by one task, that sh=
ould be safe.  But to do that I need to determine the size of the cache lin=
es, and how to force the compiler to allocate the data in the array beginni=
ng on a cache line boundary.  The second part is not hard, except that the =
compiler may not support alignment clauses that large.  The first?  A funct=
ion Cache_Line_Size in System or System.Multiprocessors seems right.  Wheth=
er it is in bits or storage_units is no big deal.  Why a function not a con=
stant?  The future looks like a mix of CPUs and GPUs all running parts of t=
he same program.

Finally, caches and NUMA galore.  I mentioned AMD's Zen above.  Right now t=
here are three Zen families with very different system architectures.  In f=
act, the difference between single and dual socket Epyc makes it four, and =
the Ryzen 3 and Ryzen APUs when released?  At least one more.  What's the b=
ig deal?  Take Threadripper to begin with.  Your choice of 12 or 16 cores e=
ach supporting two threads.  But the cache hierarchy is complex.  Each CPU =
core has two threads and its own L1 and L2 caches.  Then 3 or 4 cores, depe=
nding on the model share the same 8 Meg L3 cache. The four blocks of CPU co=
res and caches are actually split between two different chips.  Doesn't aff=
ect the cache timings much, but half the memory is attached to one chip, ha=
lf to the other.  The memory loads and stores, if to the other chip, compet=
e with L3 and cache probe traffic.  Let's condense that to this: 2 (threads=
)/(3 or 4) cores/2 (NUMA pairs)/2(chips)/1 (socket).  A Ryzen 7 chip is 2/4=
/2/1/1, Ryzen 5 is 2/(3 or 2)/2/1/1, Ryzen 3 1/2/2/1/1.  Epyc comes in at 2=
/(3 or 4)/2/4/2 among other flavors.

Writing a package to recognize these models and sort out for executable pro=
grams is going to be a non-trivial exercise--at least if I try to keep it c=
urrent.  But how to convey the information to the software which is going t=
o try to saturate the system?  No point in creating tasks which won't have =
a CPU core of their own (or half a core, or what you count Hyperthreading a=
s).  Given the size of some of these systems, even without the HPC environm=
ent, it may be better for a program to split the work between chips or boxe=
s.

Is adding these features to Ada worth the effort?  Sure.  Let me give you a=
 very realistic example.  Running on processor cores which share L3 cache m=
ay be worthwhile.  Actually with Zen, the difference is that a program that=
 stays on one L3 cache will actually save a lot of time on L2 probes.  (The=
 line you need is in L2 on another CPU core.  Moving it to your core will t=
ake less time, and more importantly less latency than moving it from anothe=
r CPU cluster.)  So we go to write our code.  On Ryzen 7 we want to run on =
cores 1, 3, 5, and 7 or 9, 11, 13 and 15, or 2, 4, 6, 8 or...  Actually I c=
ould choose 1,4,6, and 7, any set of one core from each pair staying within=
 the module (of eight threads).  Move to a low-end Ryzen 3, and I get almos=
t the same performance by choosing all the available cores: 1, 2, 3, and 4.=
  What about Ryzen 5 1600 and 1600X?  Is it going to be better to run on 3 =
cores and one L3 cache or 4 cores spread across two caches?    Or maybe cho=
ose all six cores on one L3 cache?  Argh!

Is this problem real?  Just took a program from  7.028 seconds on six cores=
, to 2.229 seconds on (the correct) three cores.  I'll post the program, or=
 put it on-line somewhere once I've confined the memory corruption to very =
small examples--so you can see which machines do it--and do a bit more clea=
nup and optimization.