From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail From: "Robert I. Eachus" Newsgroups: comp.lang.ada Subject: Re: Large number of tasks slows down my program (using debian) - any fix? Date: Sat, 7 Apr 2018 20:06:50 -0400 Organization: Aioe.org NNTP Server Message-ID: References: <1aa8f536-250d-4bef-9392-4d936f916e5f@googlegroups.com> <9377f941-31d0-4260-818a-8e189aac8c19@googlegroups.com> <10e74e0c-119a-4d86-8a12-c05101f744f1@googlegroups.com> NNTP-Posting-Host: fZYVf2g/avGnWJvs1xVPEA.user.gioia.aioe.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: abuse@aioe.org User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 X-Notice: Filtered by postfilter v. 0.8.3 Content-Language: en-US Xref: reader02.eternal-september.org comp.lang.ada:51392 Date: 2018-04-07T20:06:50-04:00 List-Id: On 4/7/2018 12:28 PM, Brad Moore wrote: > Then I thought, why even have 4 workers. Why not just one? When I set the number of Ada tasks to 1, then there is even more improvement, the code completes in 1.6 seconds. With just 1 worker, why even have a protected object? You are getting into the area of chip specific optimizations. If you ran this on an AMD Zen chip, carefully assigning processor preferences, two would almost certainly be fastest. Would 1 be better than 3? No clue. But three through eight should be about the same, then a drop, as you went out toward sixteen. Use a six-core (12 thread) Zen and substitute 6 and 12 for 8 and 16 above. Mobile Zen has different characteristics. With Intel chips, you need to know whether it has Hyperthreading, and whether it is enabled. You also need to know how many cores are present--and the sizes of the caches. What is going on? When a processor core (or thread) runs it needs the token in its L1 data cache. The cache line is larger than the Packet being passed around, either 64 or 128 bytes on most modern processors. In addition, the move involves two cores, and may require ejecting a line from the target cache. In other words, there is most of your processing time on this toy program, and on much bigger programs if you aren't careful. Why can AMD Zen CPUs and Intel CPUs with Hyperthreading do better with two threads than one? You arrange for the two logical processors to be on the same physical processor. So the caches are shared, and no cache move is required. If this was a real problem, and you needed days or months of CPU time, you optimize each thread for the cache space available, and break things up into independent threads, or threads that run well together, then assign them to the appropriate logical processors.