comp.lang.ada
 help / color / mirror / Atom feed
From: Robert Eachus <rieachus@comcast.net>
Subject: Real tasking problems with Ada.
Date: Tue, 25 Jul 2017 16:19:57 -0700 (PDT)
Date: 2017-07-25T16:19:57-07:00	[thread overview]
Message-ID: <9e51f87c-3b54-4d09-b9ca-e3c6a6e8940a@googlegroups.com> (raw)

This may come across as a rant.  I've tried to keep it down, but a little bit of ranting is probably appropriate.  Some of these problems could have been fixed earlier, but there are some which are only coming of age with new CPU and GPU designs, notably the Zen family of CPUs from AMD.

Let me first explain where I am coming from (at least in this thread).  I want to write code that takes full advantage of the features available to run code as fast as possible.  In particular I'd like to get the time to run some embarrassingly parallel routines in less than twice the total CPU time of a single CPU.  (In other words, a problem that takes T seconds (wall clock) on a single CPU should run in less than 2T/N wall clock time on N processors. Oh, and it shouldn't generate garbage.  I'm working on a test program and it did generate garbage once when I didn't guess some of the system parameters right. 

So what needs to be fixed.  First, it should be possible to assign tasks in arrays to CPUs.  With a half dozen CPU cores the current facilities are irksome.  Oh, and when doing the assignment it would be nice to ask for the facilities you need, rather than have per manufacturer's family code.  Just to start with, AMD has some processors which share floating-point units, so you want to run code on alternate CPU cores on those machines--if the tasks make heavy use of floating point.

Intel makes some processors with Hyperthreading, and some without, even within the same processor family.  Hyperthreading does let you get some extra performance out if you know what you are doing, but much of the time you will want the Hyperthreading alternate to do background and OS processing while allocating your heavy hitting work threads to the main threads.

Now look at AMD's Zen family.  Almost all available today have two threads per core like Intel's Hyperthreading, but these are much more like twins.  With random loads, each thread will do about the same amount of work.  However, if you know what you are doing, you can write code which usefully hogs all of the core's resources if it can.  Back to running on alternate cores...

I know that task array types were considered in Ada 9X.  I don't know what happened to them.  But even without them, two huge improvements would be:

   1) Add a function Current_CPU or whatever (to System.Multiprocessors) that returns the identity of the CPU this task is running on.  Obviously in a rendezvous with a protected object, the function would return the ID of the caller.  Probably do the same thing in a rendezvous between two tasks for consistency.  Note that Get_ID function in System.Multiprocessors.Dispatching_Domains does that but it requires adding three (unnecessary) packages (DD, Ada.Real_Time, and Ada.Task_Identification) to your context without really using anything there. 

   2) Allow a task to  its CPU assignment after it has started execution.  It is no big deal if a task starts on a different CPU than the one it will spend the rest of its life on.  At a minimum Set_CPU(Current_CPU) or just Set_CPU should cause the task to be anchored to its current CPU core.  Note that again you can do this with Dispatching_Domains.

   Stretch goal:  Make it possible to assign tasks to a specific pair of threads. In theory Dispatching_Domains does this, but the environment task messes things up a bit.  You need to leave the partner of the environment task's CPU core in the default dispatching domain.  The problem is that there is no guarantee that the environment task is running on CPU 1 (or CPU 0, the way the hardware numbers them).

Next, a huge problem.  I just had some code churn out garbage while I was finding the "right" settings to get each chunk of work to have its own portion of an array.  Don't tell me how to do this safely, if you do you are missing the point.  If each cache line is only written to by one task, that should be safe.  But to do that I need to determine the size of the cache lines, and how to force the compiler to allocate the data in the array beginning on a cache line boundary.  The second part is not hard, except that the compiler may not support alignment clauses that large.  The first?  A function Cache_Line_Size in System or System.Multiprocessors seems right.  Whether it is in bits or storage_units is no big deal.  Why a function not a constant?  The future looks like a mix of CPUs and GPUs all running parts of the same program.

Finally, caches and NUMA galore.  I mentioned AMD's Zen above.  Right now there are three Zen families with very different system architectures.  In fact, the difference between single and dual socket Epyc makes it four, and the Ryzen 3 and Ryzen APUs when released?  At least one more.  What's the big deal?  Take Threadripper to begin with.  Your choice of 12 or 16 cores each supporting two threads.  But the cache hierarchy is complex.  Each CPU core has two threads and its own L1 and L2 caches.  Then 3 or 4 cores, depending on the model share the same 8 Meg L3 cache. The four blocks of CPU cores and caches are actually split between two different chips.  Doesn't affect the cache timings much, but half the memory is attached to one chip, half to the other.  The memory loads and stores, if to the other chip, compete with L3 and cache probe traffic.  Let's condense that to this: 2 (threads)/(3 or 4) cores/2 (NUMA pairs)/2(chips)/1 (socket).  A Ryzen 7 chip is 2/4/2/1/1, Ryzen 5 is 2/(3 or 2)/2/1/1, Ryzen 3 1/2/2/1/1.  Epyc comes in at 2/(3 or 4)/2/4/2 among other flavors.

Writing a package to recognize these models and sort out for executable programs is going to be a non-trivial exercise--at least if I try to keep it current.  But how to convey the information to the software which is going to try to saturate the system?  No point in creating tasks which won't have a CPU core of their own (or half a core, or what you count Hyperthreading as).  Given the size of some of these systems, even without the HPC environment, it may be better for a program to split the work between chips or boxes.

Is adding these features to Ada worth the effort?  Sure.  Let me give you a very realistic example.  Running on processor cores which share L3 cache may be worthwhile.  Actually with Zen, the difference is that a program that stays on one L3 cache will actually save a lot of time on L2 probes.  (The line you need is in L2 on another CPU core.  Moving it to your core will take less time, and more importantly less latency than moving it from another CPU cluster.)  So we go to write our code.  On Ryzen 7 we want to run on cores 1, 3, 5, and 7 or 9, 11, 13 and 15, or 2, 4, 6, 8 or...  Actually I could choose 1,4,6, and 7, any set of one core from each pair staying within the module (of eight threads).  Move to a low-end Ryzen 3, and I get almost the same performance by choosing all the available cores: 1, 2, 3, and 4.  What about Ryzen 5 1600 and 1600X?  Is it going to be better to run on 3 cores and one L3 cache or 4 cores spread across two caches?    Or maybe choose all six cores on one L3 cache?  Argh!

Is this problem real?  Just took a program from  7.028 seconds on six cores, to 2.229 seconds on (the correct) three cores.  I'll post the program, or put it on-line somewhere once I've confined the memory corruption to very small examples--so you can see which machines do it--and do a bit more cleanup and optimization.

             reply	other threads:[~2017-07-25 23:19 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-25 23:19 Robert Eachus [this message]
2017-07-26 19:42 ` Real tasking problems with Ada sbelmont700
2017-07-27  2:00   ` Robert Eachus
2017-08-01  4:45     ` Randy Brukardt
2017-08-02  2:23       ` Robert Eachus
2017-08-03  3:43         ` Randy Brukardt
2017-08-03 20:03           ` Robert Eachus
2017-08-03 23:10             ` Luke A. Guest
2017-08-04 23:22             ` Randy Brukardt
2017-08-22  5:10               ` Robert Eachus
2017-08-01  4:41 ` Randy Brukardt
2017-08-02  3:44   ` Robert Eachus
2017-08-02 14:39     ` Lucretia
2017-08-03  0:57       ` Robert Eachus
2017-08-03  5:43         ` Randy Brukardt
2017-08-03  1:33       ` Wesley Pan
2017-08-03  4:30       ` Randy Brukardt
2017-08-03  4:16     ` Randy Brukardt
2017-08-03  5:05       ` Niklas Holsti
2017-08-04 23:11         ` Randy Brukardt
2017-08-05  7:01           ` Niklas Holsti
2017-08-03  8:03     ` Simon Wright
2017-08-04 23:16       ` Randy Brukardt
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox