From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!feeder.eternal-september.org!aioe.org!.POSTED!not-for-mail
From: "Robert I. Eachus" <rieachus@comcast.net>
Newsgroups: comp.lang.ada
Subject: Re: Large number of tasks slows down my program (using debian) - any
 fix?
Date: Sat, 7 Apr 2018 20:06:50 -0400
Organization: Aioe.org NNTP Server
Message-ID: <pabmep$1ej9$1@gioia.aioe.org>
References: <1aa8f536-250d-4bef-9392-4d936f916e5f@googlegroups.com>
 <9377f941-31d0-4260-818a-8e189aac8c19@googlegroups.com>
 <p9kuig$lq8$1@gioia.aioe.org>
 <f25f9607-82be-4eda-be3d-ade20724d610@googlegroups.com>
 <10e74e0c-119a-4d86-8a12-c05101f744f1@googlegroups.com>
 <pa84sd$trh$1@dont-email.me>
 <d546d4b7-31d1-4347-b283-96a5fc4e45dd@googlegroups.com>
 <pa9un7$b61$1@dont-email.me>
 <c41f508c-9b42-422c-9f58-f29c0f611416@googlegroups.com>
NNTP-Posting-Host: fZYVf2g/avGnWJvs1xVPEA.user.gioia.aioe.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: abuse@aioe.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.2
X-Notice: Filtered by postfilter v. 0.8.3
Content-Language: en-US
Xref: reader02.eternal-september.org comp.lang.ada:51392
Date: 2018-04-07T20:06:50-04:00
List-Id: <comp.lang.ada>

On 4/7/2018 12:28 PM, Brad Moore wrote:
> Then I thought, why even have 4 workers. Why not just one? When I set the number of Ada tasks to 1, then there is even more improvement, the code completes in 1.6 seconds. With just 1 worker, why even have a protected object?

You are getting into the area of chip specific optimizations.  If you 
ran this on an AMD Zen chip, carefully assigning processor preferences, 
two would almost certainly be fastest.  Would 1 be better than 3?  No 
clue.  But three through eight should be about the same, then a drop, as 
you went out toward sixteen.  Use a six-core (12 thread) Zen and 
substitute 6 and 12 for 8 and 16 above.  Mobile Zen has different 
characteristics.

With Intel chips, you need to know whether it has Hyperthreading, and 
whether it is enabled.  You also need to know how many cores are 
present--and the sizes of the caches.

What is going on?  When a processor core (or thread) runs it needs the 
token in its L1 data cache.  The cache line is larger than the Packet 
being passed around, either 64 or 128 bytes on most modern processors. 
In addition, the move involves two cores, and may require ejecting a 
line from the target cache.  In other words, there is most of your 
processing time on this toy program, and on much bigger programs if you 
aren't careful.

Why can AMD Zen CPUs and Intel CPUs with Hyperthreading do better with 
two threads than one?  You arrange for the two logical processors to be 
on the same physical processor.  So the caches are shared, and no cache 
move is required.

If this was a real problem, and you needed days or months of CPU time, 
you optimize each thread for the cache space available, and break things 
up into independent threads, or threads that run well together, then 
assign them to the appropriate logical processors.