Poor performance after upgrate to xubuntu 17.10

comp.lang.ada
 help / color / mirror / Atom feed

* Poor performance after upgrate to xubuntu 17.10
@ 2017-10-21 10:41 Charly
  2017-10-21 19:58 ` Chris M Moore
  2017-10-22 22:04 ` Robert Eachus
  0 siblings, 2 replies; 7+ messages in thread
From: Charly @ 2017-10-21 10:41 UTC (permalink / raw)


Hi,

some months ago I started a new thread about performance for the new gnat-gpl-2017 compiler and a got some usefull helps so I will try it again.

When I upgrate to new Software/Hardware I use a litte performance test program
that solves Rubics Tangle (https://www.jaapsch.net/puzzles/tangle.htm) with an
ada program, using a requested nubmer of tasks or no task at all.
The source code can be found here: https://github.com/CharlyGH/tangle.

Today I upgrated from xubuntu 17.04 to 17.10 an got the folling problem:

Let Tn be the runtime for using n tasks, T0 for no tasks.
For all previous versions I got the expected result
Tn = T0/n for all 1 <= n <= min(number of cores, 100)
The limit of 100 occures, because each task uses another tile/orientation
to start with and there are 25 tiles and 4 orientations so 100 combinations
to start with. But this limit lies in distant future :-).

Now after switching to xubuntu 17.10 I got the following strange results:


$ for n in  0 1 2 4 8 ; do ./tangle -t $n; echo "----------------"; done
using:  0 tasks
duration 2569 ms
----------------
using:  1 tasks
duration 2571 ms
----------------
using:  2 tasks
duration 2229 ms
----------------
using:  4 tasks
duration 3101 ms
----------------
using:  8 tasks
duration 2545 ms
----------------
$ 

The time is almost constant with a maximum for n = 4.
This strange result including the maximum at 4 is reproducible.

The value for n = 0 or 1 is the almost same as for the previous version of xubuntu.

When the program is running, n cores are busy at 100 %, as expected.

Booting the old Linux Kernel 4.10 had no effect. 

My Hardware
AMD FX(tm)-8350 Eight-Core Processor


Sincerly
Charly

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Poor performance after upgrate to xubuntu 17.10
  2017-10-21 10:41 Poor performance after upgrate to xubuntu 17.10 Charly
@ 2017-10-21 19:58 ` Chris M Moore
  2017-10-22 20:31   ` Charly
  2017-10-22 22:04 ` Robert Eachus
  1 sibling, 1 reply; 7+ messages in thread
From: Chris M Moore @ 2017-10-21 19:58 UTC (permalink / raw)


On 21/10/2017 11:41, Charly wrote:
> Hi,
> 
> some months ago I started a new thread about performance for the new gnat-gpl-2017 compiler and a got some usefull helps so I will try it again.
> 
> When I upgrate to new Software/Hardware I use a litte performance test program
> that solves Rubics Tangle (https://www.jaapsch.net/puzzles/tangle.htm) with an
> ada program, using a requested nubmer of tasks or no task at all.
> The source code can be found here: https://github.com/CharlyGH/tangle.
> 
> Today I upgrated from xubuntu 17.04 to 17.10 an got the folling problem:
> 
> Let Tn be the runtime for using n tasks, T0 for no tasks.
> For all previous versions I got the expected result
> Tn = T0/n for all 1 <= n <= min(number of cores, 100)
> The limit of 100 occures, because each task uses another tile/orientation
> to start with and there are 25 tiles and 4 orientations so 100 combinations
> to start with. But this limit lies in distant future :-).
> 
> Now after switching to xubuntu 17.10 I got the following strange results:
> 
> 
> $ for n in  0 1 2 4 8 ; do ./tangle -t $n; echo "----------------"; done
> using:  0 tasks
> duration 2569 ms
> ----------------
> using:  1 tasks
> duration 2571 ms
> ----------------
> using:  2 tasks
> duration 2229 ms
> ----------------
> using:  4 tasks
> duration 3101 ms
> ----------------
> using:  8 tasks
> duration 2545 ms
> ----------------
> $
> 
> The time is almost constant with a maximum for n = 4.
> This strange result including the maximum at 4 is reproducible.
> 
> The value for n = 0 or 1 is the almost same as for the previous version of xubuntu.
> 
> When the program is running, n cores are busy at 100 %, as expected.
> 
> Booting the old Linux Kernel 4.10 had no effect.
> 
> My Hardware
> AMD FX(tm)-8350 Eight-Core Processor
> 
> 
> Sincerly
> Charly
> 

I had a little look at your code.  The main part is

       declare

          Worker : Ta_Parallel.Processes
            (1 .. Ta_Types_Pkg.Proc_Id_Type(Task_Count));

       begin

          Ta_Parallel.Initialize(Verbose_Level);

          for Idx in Worker'Range loop

             Worker (Idx).Start (Idx);

          end loop;

          for Idx in Worker'Range loop

             Worker (Idx).Wait (Res);

          end loop;

       end;

The problem here is the Wait line.  After you start the tasks you then 
wait for each one to complete *in order*.  So you are at the mercy of 
the scheduling algorithm in how it schedules the first task.  Maybe when 
N=4 it starts 1, 2 3, 4 but completes 4, 3, 2, 1!

When you exit the block all tasks will have completed so you don't 
really need the Wait.

Chris

-- 
sig pending (since 1995)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Poor performance after upgrate to xubuntu 17.10
  2017-10-21 19:58 ` Chris M Moore
@ 2017-10-22 20:31   ` Charly
  0 siblings, 0 replies; 7+ messages in thread
From: Charly @ 2017-10-22 20:31 UTC (permalink / raw)


> 
> I had a little look at your code.  The main part is
> 
>        declare
> 
>           Worker : Ta_Parallel.Processes
>             (1 .. Ta_Types_Pkg.Proc_Id_Type(Task_Count));
> 
>        begin
> 
>           Ta_Parallel.Initialize(Verbose_Level);
> 
>           for Idx in Worker'Range loop
> 
>              Worker (Idx).Start (Idx);
> 
>           end loop;
> 
>           for Idx in Worker'Range loop
> 
>              Worker (Idx).Wait (Res);
> 
>           end loop;
> 
>        end;
> 
> The problem here is the Wait line.  After you start the tasks you then 
> wait for each one to complete *in order*.  So you are at the mercy of 
> the scheduling algorithm in how it schedules the first task.  Maybe when 
> N=4 it starts 1, 2 3, 4 but completes 4, 3, 2, 1!
> 
> When you exit the block all tasks will have completed so you don't 
> really need the Wait.
> 
> Chris
> 
> -- 
> sig pending (since 1995)

Hi Chris,

I tried your suggestion, but I didn't have any noticeable effekt. 

Charly


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Poor performance after upgrate to xubuntu 17.10
  2017-10-21 10:41 Poor performance after upgrate to xubuntu 17.10 Charly
  2017-10-21 19:58 ` Chris M Moore
@ 2017-10-22 22:04 ` Robert Eachus
  2017-10-23  6:11   ` Luke A. Guest
  2017-10-25 18:56   ` Charly
  1 sibling, 2 replies; 7+ messages in thread
From: Robert Eachus @ 2017-10-22 22:04 UTC (permalink / raw)

On Saturday, October 21, 2017 at 6:41:59 AM UTC-4, Charly wrote: 
> My Hardware
> AMD FX(tm)-8350 Eight-Core Processor

Oh boy! Welcome to the wonderful world of modern tasking.  Intel chips with Hyperthreading and the new AMD Ryzens are different, but the issues come out the same: sometimes not all cores can be treated equally.

The 8350 has four modules with two cores each. Each core has its own L1 instruction and data cache.  It shares a 2 MByte L2 cache with its partner in the module, and there is an 8 Meg L3 cache.  I assume your program is small enough that the compute tasks' instructions and data fit into the L1 caches.

If you are using any floating point instructions or registers, that opens up more potential problems.  I have some compute cores that work best on Bulldozer family AMD chips and Intel chips with Hyperthreading by using every other CPU number: 0,2,4,6 in your case.  But I don't think this code runs into that.

So far, so good.  But it looks like you are getting tripped up by one or more data cache lines being shared between compute engines. (Instruction cache lines?  Sharing is fine.) It could be an actual value shared among tasks, or several different values that get allocated in close proximity.  I hope, and count on, task stacks not being adjacent, so this usually happens for (shared) variables in the parent of the tasks, or variables in the spec of generic library packages.

If this happens, the cache management will result in just what you are seeing.  Owning that cache line will act like a ring token passed from task to task. Parallel and Ta_Types are the two packages I'd be suspicious of. The detail here that may be biting you is that the variables in these packages are on the main stack, not duplicated, if necessary, in each task stack.

Eventually you get to the point of paranoia where you make anything that goes on the main stack a multiple of 64 or 128 bytes and insure that the compiler follows your intent. You also have the worker tasks copy as constants any main program variables that are, in their view, constants.

Finally, just good task programming.  If you expect to have each task on its own CPU core or thread, use affinities to tie them to specific cores.  Why?  Modern processors do not flush all caches when an interrupt is serviced.  If you have an interrupt that doesn't, you want the same task back on that CPU or thread. (In fact, some CPUs go further, and have ownership tags on the cache lines.  So some data in cache can belong to the OS, and the rest of it to your task.)

Note that when setting affinities, CPU 0 becomes affinity 1, etc. For each thread there is a bit vector of threads it can run on.  On Windows, the argument is a hex number that converts to a bit vector.  On a Hyperthreaded or Zen CPU, affinity 3 means run on either thread on CPU 0.  In your case 3 would mean run on either of the processors in module 0, and so on.  Setting affinity to 0 is not a good idea.

By the way, is the duplicate value 'X' in the declaration of Ta_Types.Chip_Name intentional? Certainly worth a comment if it is.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Poor performance after upgrate to xubuntu 17.10
  2017-10-22 22:04 ` Robert Eachus
@ 2017-10-23  6:11   ` Luke A. Guest
  2017-10-23  8:00     ` Mark Lorenzen
  2017-10-25 18:56   ` Charly
  1 sibling, 1 reply; 7+ messages in thread
From: Luke A. Guest @ 2017-10-23  6:11 UTC (permalink / raw)


Robert Eachus <rieachus@comcast.net> wrote:
.
> 
> Finally, just good task programming.  If you expect to have each task on
> its own CPU core or thread, use affinities to tie them to specific cores.  

How do you do this in Ada?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Poor performance after upgrate to xubuntu 17.10
  2017-10-23  6:11   ` Luke A. Guest
@ 2017-10-23  8:00     ` Mark Lorenzen
  0 siblings, 0 replies; 7+ messages in thread
From: Mark Lorenzen @ 2017-10-23  8:00 UTC (permalink / raw)


On Monday, October 23, 2017 at 8:11:25 AM UTC+2, Luke A. Guest wrote:
> Robert Eachus <rieachus@comcast.net> wrote:
> .
> > 
> > Finally, just good task programming.  If you expect to have each task on
> > its own CPU core or thread, use affinities to tie them to specific cores.  
> 
> How do you do this in Ada?


Aspect/pragma "CPU".

Regards,

Mark L


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Poor performance after upgrate to xubuntu 17.10
  2017-10-22 22:04 ` Robert Eachus
  2017-10-23  6:11   ` Luke A. Guest
@ 2017-10-25 18:56   ` Charly
  1 sibling, 0 replies; 7+ messages in thread
From: Charly @ 2017-10-25 18:56 UTC (permalink / raw)


Am Montag, 23. Oktober 2017 00:04:19 UTC+2 schrieb Robert Eachus:
> On Saturday, October 21, 2017 at 6:41:59 AM UTC-4, Charly wrote: 
> > My Hardware
> > AMD FX(tm)-8350 Eight-Core Processor
> 
> Oh boy! Welcome to the wonderful world of modern tasking.  Intel chips with Hyperthreading and the new AMD Ryzens are different, but the issues come out the same: sometimes not all cores can be treated equally.
> 
> The 8350 has four modules with two cores each. Each core has its own L1 instruction and data cache.  It shares a 2 MByte L2 cache with its partner in the module, and there is an 8 Meg L3 cache.  I assume your program is small enough that the compute tasks' instructions and data fit into the L1 caches.
> 
> If you are using any floating point instructions or registers, that opens up more potential problems.  I have some compute cores that work best on Bulldozer family AMD chips and Intel chips with Hyperthreading by using every other CPU number: 0,2,4,6 in your case.  But I don't think this code runs into that.
> 
> So far, so good.  But it looks like you are getting tripped up by one or more data cache lines being shared between compute engines. (Instruction cache lines?  Sharing is fine.) It could be an actual value shared among tasks, or several different values that get allocated in close proximity.  I hope, and count on, task stacks not being adjacent, so this usually happens for (shared) variables in the parent of the tasks, or variables in the spec of generic library packages.
> 
> If this happens, the cache management will result in just what you are seeing.  Owning that cache line will act like a ring token passed from task to task. Parallel and Ta_Types are the two packages I'd be suspicious of. The detail here that may be biting you is that the variables in these packages are on the main stack, not duplicated, if necessary, in each task stack.
> 
> Eventually you get to the point of paranoia where you make anything that goes on the main stack a multiple of 64 or 128 bytes and insure that the compiler follows your intent. You also have the worker tasks copy as constants any main program variables that are, in their view, constants.
> 
> Finally, just good task programming.  If you expect to have each task on its own CPU core or thread, use affinities to tie them to specific cores.  Why?  Modern processors do not flush all caches when an interrupt is serviced.  If you have an interrupt that doesn't, you want the same task back on that CPU or thread. (In fact, some CPUs go further, and have ownership tags on the cache lines.  So some data in cache can belong to the OS, and the rest of it to your task.)
> 
> Note that when setting affinities, CPU 0 becomes affinity 1, etc. For each thread there is a bit vector of threads it can run on.  On Windows, the argument is a hex number that converts to a bit vector.  On a Hyperthreaded or Zen CPU, affinity 3 means run on either thread on CPU 0.  In your case 3 would mean run on either of the processors in module 0, and so on.  Setting affinity to 0 is not a good idea.
>  
> By the way, is the duplicate value 'X' in the declaration of Ta_Types.Chip_Name intentional? Certainly worth a comment if it is.


Hi,

thank you for your elaborate answer but I still don't see why I got this
strange behaviour, after I upgrated to the new version.

For the previous 8 ubuntu-versions that I used for this hareware I had a
decreasing runtime and constant total CPU usage when I increased the number
of tasks.  But now the runtime  almost constant the total CPU usage increases
with the number of tasks.

I assume it's caused by the interaction of gnat and the new glib/libpthread libraries.

B.t.w. the two 'X' in Chip_Name is not a typo.
There are 24 permutations of the colors red, green, blue and yellow,
but 25 tiles so one permutation must occure twice.

Sincerly
Charly

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-10-25 18:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-21 10:41 Poor performance after upgrate to xubuntu 17.10 Charly
2017-10-21 19:58 ` Chris M Moore
2017-10-22 20:31   ` Charly
2017-10-22 22:04 ` Robert Eachus
2017-10-23  6:11   ` Luke A. Guest
2017-10-23  8:00     ` Mark Lorenzen
2017-10-25 18:56   ` Charly

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox