From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!mx02.eternal-september.org!feeder.eternal-september.org!news.glorb.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post02.iad.highwinds-media.com!fx18.iad.POSTED!not-for-mail From: Brad Moore User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 Newsgroups: comp.lang.ada Subject: Re: =?windows-1252?Q?GNAT=A0and_Tasklets?= References: <8277a521-7317-4839-b0b6-97f8155be1a4@googlegroups.com> <9e1d2b9f-1b97-4679-8eec-5ba75f3c357c@googlegroups.com> <478c81f9-5233-4ae1-a3eb-e67c4dbd0da1@googlegroups.com> <1r2ziulc78imb$.ad6zx5upic6s$.dlg@40tude.net> In-Reply-To: <1r2ziulc78imb$.ad6zx5upic6s$.dlg@40tude.net> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Message-ID: NNTP-Posting-Host: 68.145.219.148 X-Complaints-To: internet.abuse@sjrb.ca X-Trace: 1419014105 68.145.219.148 (Fri, 19 Dec 2014 18:35:05 UTC) NNTP-Posting-Date: Fri, 19 Dec 2014 18:35:05 UTC Date: Fri, 19 Dec 2014 11:35:05 -0700 X-Received-Bytes: 9572 X-Received-Body-CRC: 4285467826 Xref: news.eternal-september.org comp.lang.ada:24157 Date: 2014-12-19T11:35:05-07:00 List-Id: On 14-12-19 10:28 AM, Dmitry A. Kazakov wrote: > On Fri, 19 Dec 2014 09:42:52 -0700, Brad Moore wrote: > >> On 14-12-19 04:01 AM, Dmitry A. Kazakov wrote: >>> On Fri, 19 Dec 2014 10:40:03 +0000 (UTC), Georg Bauhaus wrote: >>> >>>> wrote: >>>> >>>>> It would be interesting to do a little survey on existing code using tasking. >>>>> I have the impression that only tasks at Library level do rendez-vous and >>>>> protected object synchronisation, and local tasks, most of the time, are >>>>> limited to a rendez-vous with their parent task at the beginning or at >>>>> the end. So maybe we should put restrictions on local tasks, so that we >>>>> can map them to jobs. >>>> >>>> Won't the parallel loop feature be providing >>>> for this kind of mini job: >>> >>> Parallel loop is useless for practical purposes. It wonders me why people >>> wasting time with this. >> >> For multicore, the idea is to make better use of the cores when doing so >> will improve performance. > > I don't think multi-core would bring any advantage. Starting / activating / > reusing / feeding / re-synchronizing threads is too expensive. Yes, there is overhead, but if the amount of overhead is small relative to the work being performed, and the work can be divided up between multicple cores then there obviously is a performance advantage. > > Parallel loops could be useful on some massively vectorized architectures > in some very specialized form, or on architectures with practically > infinite number of cores (e.g. molecular computers). Anyway feeding threads > with inputs and gathering outputs may still mean more overhead than any > gain. > >> Also, the number of iterations does not need to be large to see >> parallelism benefits. >> >> for I in parallel 1 .. 10 loop >> Lengthy_Processing_Of_Image (I); >> end loop; > > Certainly not for image processing. In image processing when doing it by > segments, you need to sew the segments along their borders, practically in > all algorithms. That makes parallelization far more complicated to be > handled by such a blunt thing as parallel loop. If you have special hardware, such as vectorization support, that is another form of parallelism. Not all platforms have such hardware support though. A compiler might decide to take advantage of both hardware and software parallelism for a particular problem. The example here was meant to be a trivial one. The example was meant to show manipulation of a large data structure such as a bit array. Another example might be solving a large matrix of differential equations. > >>> They could start with logical operations instead: >>> >>> X and Y >>> >>> is already parallel by default. AFAIK nothing in RM forbids concurrent >>> evaluation of X and Y if they are independent. Same with Ada arithmetic. >>> E.g. >>> >>> A + B + C + D >>> >>> So far no compiler evaluates arguments concurrently or vectorizes >>> sub-expressions like: >>> >>> A >>> B + >>> C + >>> D + >>> >>> Because if they did the result would work slower than sequential code. It >>> simply does not worth the efforts with existing machine architectures. >> >> The compiler should be able to make the decision to parallelize these if >> there is any benefit to doing so. Likely the decision would be to *not* >> parallelize these, if A, B, C, and D are objects of some elementary type.. >> >> But it depends on the datatype of A, B, C, and D. >> >> Also A, B, C, and D might be function calls, not simple data references, >> and these calls might involve lengthy processing, in which case, adding >> parallelism might make sense. > > Yes, provided the language has means to describe side effects of such > computations in a way making the decision safe. > >> Or, if these are objects of a Big Number library with infinite >> precision, you might have an irrational number with pages of digits each >> for numerator and denominator. Performing math on such values might very >> well benefit from parallelism. > > It won't, because a big number library will use the heap or a shared part > of the stack which will require interlocking and thus will either be marked > as "impure", so that the compiler will not try to parallelize, or else will > make the compiler to use locks, which will effectively kill parallelism. A while back I wrote an example tackling the problem associated with the first computer program, (the one written by Ada Lovelace), which is to generate Bernoulli numbers. I originally wanted to use an algorithm that closely matched the processing of Babbage's Analytical Engine, but found that algorithm couldn't be parallelized, since the results of each iteration depended on the results from previous iterations. However, I did find another algorithm to generate Bernoulli numbers that could be parallelized. For that implementation, I wrote a Big Number library, and found indeed that the parallelism did work, and was beneficial. At the time I believe I was just using the default memory pool, which would have had the interlocking you mention. If I were to try again, I would try to use a pool such as the Depend storage pool, where each worker thread has its own local pool, and thus no interlocking is needed. I would expect to see even more performance gains using this approach. > >> We looked at being able to explicitly state parallelism for subprograms, >> (parallel subprograms), but found that syntax was messy, and there were >> too many other problems. >> >> We are currently thinking a parallel block syntax better provides this >> capability, if the programmer wants to explicitly indicate where >> parallelism is desired. >> >> eg. >> >> Left, Right, Total : Integer := 0; >> >> parallel >> Left := A + B; >> and >> Right := C + D; >> end parallel; >> >> Total := Left + Right; >> >> or possibly allow some automatic reduction >> >> >> Total : Integer with Reduce := 0; >> >> parallel >> Total := A + B; >> and >> Total := C + D; >> end parallel; >> >> Here, two "tasklets" would get created that can execute in parallel, >> each with their local instance of the Total result (i.e. thread local >> storage), and at the end of the parallel block, the two results are >> reduced into one and assigned to the actual Total. >> >> The reduction operation and identity value used to initialize the local >> instances of the Total could be defaulted by the compiler for simple >> data types, but could be explicitly stated if desired. >> >> eg. >> >> Total : Integer with Reduce => (Reducer => "+", Identity => 0) := 0; >> >> parallel >> Total := A + B; >> and >> Total := C + D; >> end parallel; > > The semantics is not clear. What happens if: > > parallel > Total := Total + 1; > Total := A + B; > and > Total := C + D; > end parallel; > > and of course the question of exceptions raised within concurrent paths. > I haven't listed all the semantics but for the questions you ask, each arm of the parallel block is a separate thread of execution (which we have been calling a tasklet). Each tasklet starts off with its own local declaration of Total, initialized to 0, which is the Identity value for the reduction. So, for the top Total, you end up with {Top_}Total := 1 + A + B; for the bottom Total, you end up with {Bottom_}Total := C + D; Then during the reduction phase, those two results get reduced using the reduction operation, which in this case is "+". So the end result is Total = 1 + A + B + C + D; As for exceptions, we are thinking that execution does not continue past the parallel block until all branches have finished executing. If exceptions are raised in multiple branches of the parallel block, then only of those exceptions would be selected, and only that one exception would get propagated outside the block. Brad