From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.107.161.79 with SMTP id k76mr14509406ioe.51.1501024798230; Tue, 25 Jul 2017 16:19:58 -0700 (PDT) X-Received: by 10.36.50.19 with SMTP id j19mr456408ita.11.1501024798196; Tue, 25 Jul 2017 16:19:58 -0700 (PDT) Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!news.glorb.com!t6no1025459itb.0!news-out.google.com!f71ni5107itc.0!nntp.google.com!t6no1025453itb.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Tue, 25 Jul 2017 16:19:57 -0700 (PDT) Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=2601:191:8303:2100:5985:2c17:9409:aa9c; posting-account=fdRd8woAAADTIlxCu9FgvDrUK4wPzvy3 NNTP-Posting-Host: 2601:191:8303:2100:5985:2c17:9409:aa9c User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <9e51f87c-3b54-4d09-b9ca-e3c6a6e8940a@googlegroups.com> Subject: Real tasking problems with Ada. From: Robert Eachus Injection-Date: Tue, 25 Jul 2017 23:19:58 +0000 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Xref: news.eternal-september.org comp.lang.ada:47511 Date: 2017-07-25T16:19:57-07:00 List-Id: This may come across as a rant. I've tried to keep it down, but a little b= it of ranting is probably appropriate. Some of these problems could have b= een fixed earlier, but there are some which are only coming of age with new= CPU and GPU designs, notably the Zen family of CPUs from AMD. Let me first explain where I am coming from (at least in this thread). I w= ant to write code that takes full advantage of the features available to ru= n code as fast as possible. In particular I'd like to get the time to run = some embarrassingly parallel routines in less than twice the total CPU time= of a single CPU. (In other words, a problem that takes T seconds (wall cl= ock) on a single CPU should run in less than 2T/N wall clock time on N proc= essors. Oh, and it shouldn't generate garbage. I'm working on a test progr= am and it did generate garbage once when I didn't guess some of the system = parameters right.=20 So what needs to be fixed. First, it should be possible to assign tasks in= arrays to CPUs. With a half dozen CPU cores the current facilities are ir= ksome. Oh, and when doing the assignment it would be nice to ask for the f= acilities you need, rather than have per manufacturer's family code. Just = to start with, AMD has some processors which share floating-point units, so= you want to run code on alternate CPU cores on those machines--if the task= s make heavy use of floating point. Intel makes some processors with Hyperthreading, and some without, even wit= hin the same processor family. Hyperthreading does let you get some extra = performance out if you know what you are doing, but much of the time you wi= ll want the Hyperthreading alternate to do background and OS processing whi= le allocating your heavy hitting work threads to the main threads. Now look at AMD's Zen family. Almost all available today have two threads = per core like Intel's Hyperthreading, but these are much more like twins. = With random loads, each thread will do about the same amount of work. Howe= ver, if you know what you are doing, you can write code which usefully hogs= all of the core's resources if it can. Back to running on alternate cores= ... I know that task array types were considered in Ada 9X. I don't know what = happened to them. But even without them, two huge improvements would be: 1) Add a function Current_CPU or whatever (to System.Multiprocessors) th= at returns the identity of the CPU this task is running on. Obviously in a= rendezvous with a protected object, the function would return the ID of th= e caller. Probably do the same thing in a rendezvous between two tasks for= consistency. Note that Get_ID function in System.Multiprocessors.Dispatch= ing_Domains does that but it requires adding three (unnecessary) packages (= DD, Ada.Real_Time, and Ada.Task_Identification) to your context without rea= lly using anything there.=20 2) Allow a task to its CPU assignment after it has started execution. = It is no big deal if a task starts on a different CPU than the one it will = spend the rest of its life on. At a minimum Set_CPU(Current_CPU) or just S= et_CPU should cause the task to be anchored to its current CPU core. Note = that again you can do this with Dispatching_Domains. Stretch goal: Make it possible to assign tasks to a specific pair of th= reads. In theory Dispatching_Domains does this, but the environment task me= sses things up a bit. You need to leave the partner of the environment tas= k's CPU core in the default dispatching domain. The problem is that there = is no guarantee that the environment task is running on CPU 1 (or CPU 0, th= e way the hardware numbers them). Next, a huge problem. I just had some code churn out garbage while I was f= inding the "right" settings to get each chunk of work to have its own porti= on of an array. Don't tell me how to do this safely, if you do you are mis= sing the point. If each cache line is only written to by one task, that sh= ould be safe. But to do that I need to determine the size of the cache lin= es, and how to force the compiler to allocate the data in the array beginni= ng on a cache line boundary. The second part is not hard, except that the = compiler may not support alignment clauses that large. The first? A funct= ion Cache_Line_Size in System or System.Multiprocessors seems right. Wheth= er it is in bits or storage_units is no big deal. Why a function not a con= stant? The future looks like a mix of CPUs and GPUs all running parts of t= he same program. Finally, caches and NUMA galore. I mentioned AMD's Zen above. Right now t= here are three Zen families with very different system architectures. In f= act, the difference between single and dual socket Epyc makes it four, and = the Ryzen 3 and Ryzen APUs when released? At least one more. What's the b= ig deal? Take Threadripper to begin with. Your choice of 12 or 16 cores e= ach supporting two threads. But the cache hierarchy is complex. Each CPU = core has two threads and its own L1 and L2 caches. Then 3 or 4 cores, depe= nding on the model share the same 8 Meg L3 cache. The four blocks of CPU co= res and caches are actually split between two different chips. Doesn't aff= ect the cache timings much, but half the memory is attached to one chip, ha= lf to the other. The memory loads and stores, if to the other chip, compet= e with L3 and cache probe traffic. Let's condense that to this: 2 (threads= )/(3 or 4) cores/2 (NUMA pairs)/2(chips)/1 (socket). A Ryzen 7 chip is 2/4= /2/1/1, Ryzen 5 is 2/(3 or 2)/2/1/1, Ryzen 3 1/2/2/1/1. Epyc comes in at 2= /(3 or 4)/2/4/2 among other flavors. Writing a package to recognize these models and sort out for executable pro= grams is going to be a non-trivial exercise--at least if I try to keep it c= urrent. But how to convey the information to the software which is going t= o try to saturate the system? No point in creating tasks which won't have = a CPU core of their own (or half a core, or what you count Hyperthreading a= s). Given the size of some of these systems, even without the HPC environm= ent, it may be better for a program to split the work between chips or boxe= s. Is adding these features to Ada worth the effort? Sure. Let me give you a= very realistic example. Running on processor cores which share L3 cache m= ay be worthwhile. Actually with Zen, the difference is that a program that= stays on one L3 cache will actually save a lot of time on L2 probes. (The= line you need is in L2 on another CPU core. Moving it to your core will t= ake less time, and more importantly less latency than moving it from anothe= r CPU cluster.) So we go to write our code. On Ryzen 7 we want to run on = cores 1, 3, 5, and 7 or 9, 11, 13 and 15, or 2, 4, 6, 8 or... Actually I c= ould choose 1,4,6, and 7, any set of one core from each pair staying within= the module (of eight threads). Move to a low-end Ryzen 3, and I get almos= t the same performance by choosing all the available cores: 1, 2, 3, and 4.= What about Ryzen 5 1600 and 1600X? Is it going to be better to run on 3 = cores and one L3 cache or 4 cores spread across two caches? Or maybe cho= ose all six cores on one L3 cache? Argh! Is this problem real? Just took a program from 7.028 seconds on six cores= , to 2.229 seconds on (the correct) three cores. I'll post the program, or= put it on-line somewhere once I've confined the memory corruption to very = small examples--so you can see which machines do it--and do a bit more clea= nup and optimization.