From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.13.255.194 with SMTP id p185mr8426138ywf.205.1501645441518; Tue, 01 Aug 2017 20:44:01 -0700 (PDT) X-Received: by 10.36.80.193 with SMTP id m184mr172241itb.6.1501645441473; Tue, 01 Aug 2017 20:44:01 -0700 (PDT) Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!news.glorb.com!w51no4611qtc.0!news-out.google.com!196ni1828itl.0!nntp.google.com!u14no392163ita.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Tue, 1 Aug 2017 20:44:01 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=2601:191:8303:2100:ad21:fae8:74f1:4499; posting-account=fdRd8woAAADTIlxCu9FgvDrUK4wPzvy3 NNTP-Posting-Host: 2601:191:8303:2100:ad21:fae8:74f1:4499 References: <9e51f87c-3b54-4d09-b9ca-e3c6a6e8940a@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: Subject: Re: Real tasking problems with Ada. From: Robert Eachus Injection-Date: Wed, 02 Aug 2017 03:44:01 +0000 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Xref: news.eternal-september.org comp.lang.ada:47553 Date: 2017-08-01T20:44:01-07:00 List-Id: On Tuesday, August 1, 2017 at 12:42:01 AM UTC-4, Randy Brukardt wrote: =20 > Use a discriminant of type CPU and (in Ada 2020) an=20 > iterated_component_association. (This was suggested way back in Ada 9x, l= eft=20 > out in the infamous "scope reduction", and then forgotten about until 201= 3.=20 > See AI12-0061-1 or the draft RM=20 > http://www.ada-auth.org/standards/2xaarm/html/AA-4-3-3.html). It is nice that this is in there, just in-time to be outmoded. See my prev= ious post, on most desktop and server processors assigning to alternate pro= cessor IDs is necessary. > Why do you say this? Ada doesn't require the task that calls a protected= =20 > object to execute it (the execution can be handdled by the task that=20 > services the barriers - I don't know if any implementation actually does= =20 > this, but the language rules are written to allow it). Um, I say it because any other result is useless? The use case is for the = called task or protected object to be able to get the CPU number and do som= ething with it. The simple case right now would be a protected object whos= e only purpose is to make sure task assignments are compatible with the arc= hitecture. Again with Zen, Epyx has multiprocessors with 64 cores, 128 thr= eads that will show as CPUs. There are favored pairings of threads so that= they share caches, or have shorter paths to parts of L3. Intel says they = will support up to eight sockets, with 28 CPU cores and 56 threads per sock= et. I don't believe a system with 448 CPU threads will be realistic, even = 224 will probably stress scheduling. (Note that many of these "big" system= s use VMware to create lots of virtual machines with four or eight threads.= ) > Say what? You're using Get_Id, so clearly you're using something there.= =20 > Get_Id (like the rest of dispatching domains is likely to be expensive, s= o=20 > you don't want it dragged into all programs. (And CPU is effectively part= of=20 > all programs.) Sigh! Get_Id as defined is heavy only because of the default initial value: function Get_CPU (T : Ada.Task_Identification.Task_Id :=3D Ada.Task_Identification.Current_Task) return CPU_Range; but a parameterless Get_CPU could compile to a single load instruction. I'= m not asking for the current function to be eliminated, there are situation= s where it is needed. But it doesn't need all that baggage for the normal = use. >=20 > > 2) Allow a task to its CPU assignment after it has started execution. = It > > is no big deal if a task starts on a different CPU than the one it will= =20 > > spend > > the rest of its life on. At a minimum Set_CPU(Current_CPU) or just > > Set_CPU should cause the task to be anchored to its current CPU core. > > Note that again you can do this with Dispatching_Domains. Left out two words above. Should read "Allow a task to statically set its = CPU assignment... >=20 > So the capability already exists, but you don't like having to with an ex= tra=20 > package to use it? Have you lost your freaking mind? You want us to add= =20 > operations that ALREADY EXIST to another package, with all of the=20 > compatibility problems that doing so would cause (especially for people t= hat=20 > had withed and used Dispatching_Domains)? When there are lots of problems= =20 > that can't be solved portably at all? No, I don't want compilers putting in extra code when it is not necessary. = If a task has a (static) CPU assignment then again, Get_CPU is essentially= free. Is it cheaper than fishing some index out of main memory because it= got flushed there? Probably. Yes, I can make a function My_CPU which does= this. I'm just making too many of those workarounds right now.=20 >=20 > No, you're missing the point. Ada is about writing portable code. Nothing= at=20 > the level of "cache lines" is EVER going to be portable in any way. Eithe= r=20 > one writes "safe" code and hopefully the compiler and runtime can take in= to=20 > account the characteristics of the target. (Perhaps parallel loop constru= cts=20 > will help with that.) >=20 > Or otherwise, one writes bleeding edge code that is not portable and not= =20 > safe. And you're on your own in such a case; no programming language coul= d=20 > possibly help you. I am trying to write portable code. Portable enough to run on all modern A= MD64 and EM64T (Intel) CPUs. A table of processor IDs with the associated = values for the numbers I need in the program is not going to happen. Most = of the parameters I need can be found in Ada now. Cache line size is one t= hat is not there. >=20 > >A function Cache_Line_Size in System or System.Multiprocessors seems rig= ht. >=20 > No, it doesn't. It assumes a particular memory organization, and one thi= ng=20 > that's pretty clear is that whatever memory organization is common now wi= ll=20 > not be common in a bunch of years. Besides, so many systems have multiple= =20 > layers of caches, that a single result won't be enough. And there is no w= ay=20 > for a general implementation to find this out (neither CPUs nor kernels= =20 > describe such information). Um. No. Systems may have multiple levels of caches of different sizes and = different numbers of "ways" per cache. But the actual cache line size is a= lmost locked in, and is the same for all caches in a system. Most systems = with DDR3 and DDR4 use 64 byte cache lines because it matches the memory bu= rst length. But other values are possible. Right now HBM2 is pushing GPUs= (not CPUs) to 256 byte cache lines. Will we eventually have Ada compilers= generating code for heterogeneous systems? Possible. What I am working o= n is building the blocks that can be used with DirectCompute, OpenCL 2.0, a= nd perhaps other GPU software interfaces.=20 >=20 > >Is adding these features to Ada worth the effort? >=20 > No way. They're much too low level, and they actually aren't enough to al= low=20 > parallelization. You want a language which allows fine-grained parallelis= m=20 > from the start (like Parasail); trying to retrofit that on Ada (which is= =20 > mainly sequential, only having coarse parallelism) just will make a mess.= =20 > You might get a few problems solved (those using actual arrays, as oppose= d=20 > to containers or user-defined types -- which one hopes are far more commo= n=20 > in today's programs), but there is nothing general, nor anything that fit= s=20 > into Ada's building block approach, at the level that you're discussing. >=20 For now we can agree to disagree. The difference is the size of the arrays= we have to deal with. When arrays get to tens of millions of entries, and= operations on them can take tens of billions operations, I don't think I a= m talking about fine grained parallelism. The main characteristics of the = operations I want to get working: matrix multiplication and inversion, line= ar programming, FFT, FLIR and FLRadar, Navier-Stokes, all have the form of = a set of huge data arrays constant once loaded, and can be parallelized acr= oss large numbers of CPU cores, or GPUs. =20