* Poor performance after upgrate to xubuntu 17.10 @ 2017-10-21 10:41 Charly 2017-10-21 19:58 ` Chris M Moore 2017-10-22 22:04 ` Robert Eachus 0 siblings, 2 replies; 7+ messages in thread From: Charly @ 2017-10-21 10:41 UTC (permalink / raw) Hi, some months ago I started a new thread about performance for the new gnat-gpl-2017 compiler and a got some usefull helps so I will try it again. When I upgrate to new Software/Hardware I use a litte performance test program that solves Rubics Tangle (https://www.jaapsch.net/puzzles/tangle.htm) with an ada program, using a requested nubmer of tasks or no task at all. The source code can be found here: https://github.com/CharlyGH/tangle. Today I upgrated from xubuntu 17.04 to 17.10 an got the folling problem: Let Tn be the runtime for using n tasks, T0 for no tasks. For all previous versions I got the expected result Tn = T0/n for all 1 <= n <= min(number of cores, 100) The limit of 100 occures, because each task uses another tile/orientation to start with and there are 25 tiles and 4 orientations so 100 combinations to start with. But this limit lies in distant future :-). Now after switching to xubuntu 17.10 I got the following strange results: $ for n in 0 1 2 4 8 ; do ./tangle -t $n; echo "----------------"; done using: 0 tasks duration 2569 ms ---------------- using: 1 tasks duration 2571 ms ---------------- using: 2 tasks duration 2229 ms ---------------- using: 4 tasks duration 3101 ms ---------------- using: 8 tasks duration 2545 ms ---------------- $ The time is almost constant with a maximum for n = 4. This strange result including the maximum at 4 is reproducible. The value for n = 0 or 1 is the almost same as for the previous version of xubuntu. When the program is running, n cores are busy at 100 %, as expected. Booting the old Linux Kernel 4.10 had no effect. My Hardware AMD FX(tm)-8350 Eight-Core Processor Sincerly Charly ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Poor performance after upgrate to xubuntu 17.10 2017-10-21 10:41 Poor performance after upgrate to xubuntu 17.10 Charly @ 2017-10-21 19:58 ` Chris M Moore 2017-10-22 20:31 ` Charly 2017-10-22 22:04 ` Robert Eachus 1 sibling, 1 reply; 7+ messages in thread From: Chris M Moore @ 2017-10-21 19:58 UTC (permalink / raw) On 21/10/2017 11:41, Charly wrote: > Hi, > > some months ago I started a new thread about performance for the new gnat-gpl-2017 compiler and a got some usefull helps so I will try it again. > > When I upgrate to new Software/Hardware I use a litte performance test program > that solves Rubics Tangle (https://www.jaapsch.net/puzzles/tangle.htm) with an > ada program, using a requested nubmer of tasks or no task at all. > The source code can be found here: https://github.com/CharlyGH/tangle. > > Today I upgrated from xubuntu 17.04 to 17.10 an got the folling problem: > > Let Tn be the runtime for using n tasks, T0 for no tasks. > For all previous versions I got the expected result > Tn = T0/n for all 1 <= n <= min(number of cores, 100) > The limit of 100 occures, because each task uses another tile/orientation > to start with and there are 25 tiles and 4 orientations so 100 combinations > to start with. But this limit lies in distant future :-). > > Now after switching to xubuntu 17.10 I got the following strange results: > > > $ for n in 0 1 2 4 8 ; do ./tangle -t $n; echo "----------------"; done > using: 0 tasks > duration 2569 ms > ---------------- > using: 1 tasks > duration 2571 ms > ---------------- > using: 2 tasks > duration 2229 ms > ---------------- > using: 4 tasks > duration 3101 ms > ---------------- > using: 8 tasks > duration 2545 ms > ---------------- > $ > > The time is almost constant with a maximum for n = 4. > This strange result including the maximum at 4 is reproducible. > > The value for n = 0 or 1 is the almost same as for the previous version of xubuntu. > > When the program is running, n cores are busy at 100 %, as expected. > > Booting the old Linux Kernel 4.10 had no effect. > > My Hardware > AMD FX(tm)-8350 Eight-Core Processor > > > Sincerly > Charly > I had a little look at your code. The main part is declare Worker : Ta_Parallel.Processes (1 .. Ta_Types_Pkg.Proc_Id_Type(Task_Count)); begin Ta_Parallel.Initialize(Verbose_Level); for Idx in Worker'Range loop Worker (Idx).Start (Idx); end loop; for Idx in Worker'Range loop Worker (Idx).Wait (Res); end loop; end; The problem here is the Wait line. After you start the tasks you then wait for each one to complete *in order*. So you are at the mercy of the scheduling algorithm in how it schedules the first task. Maybe when N=4 it starts 1, 2 3, 4 but completes 4, 3, 2, 1! When you exit the block all tasks will have completed so you don't really need the Wait. Chris -- sig pending (since 1995) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Poor performance after upgrate to xubuntu 17.10 2017-10-21 19:58 ` Chris M Moore @ 2017-10-22 20:31 ` Charly 0 siblings, 0 replies; 7+ messages in thread From: Charly @ 2017-10-22 20:31 UTC (permalink / raw) > > I had a little look at your code. The main part is > > declare > > Worker : Ta_Parallel.Processes > (1 .. Ta_Types_Pkg.Proc_Id_Type(Task_Count)); > > begin > > Ta_Parallel.Initialize(Verbose_Level); > > for Idx in Worker'Range loop > > Worker (Idx).Start (Idx); > > end loop; > > for Idx in Worker'Range loop > > Worker (Idx).Wait (Res); > > end loop; > > end; > > The problem here is the Wait line. After you start the tasks you then > wait for each one to complete *in order*. So you are at the mercy of > the scheduling algorithm in how it schedules the first task. Maybe when > N=4 it starts 1, 2 3, 4 but completes 4, 3, 2, 1! > > When you exit the block all tasks will have completed so you don't > really need the Wait. > > Chris > > -- > sig pending (since 1995) Hi Chris, I tried your suggestion, but I didn't have any noticeable effekt. Charly ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Poor performance after upgrate to xubuntu 17.10 2017-10-21 10:41 Poor performance after upgrate to xubuntu 17.10 Charly 2017-10-21 19:58 ` Chris M Moore @ 2017-10-22 22:04 ` Robert Eachus 2017-10-23 6:11 ` Luke A. Guest 2017-10-25 18:56 ` Charly 1 sibling, 2 replies; 7+ messages in thread From: Robert Eachus @ 2017-10-22 22:04 UTC (permalink / raw) On Saturday, October 21, 2017 at 6:41:59 AM UTC-4, Charly wrote: > My Hardware > AMD FX(tm)-8350 Eight-Core Processor Oh boy! Welcome to the wonderful world of modern tasking. Intel chips with Hyperthreading and the new AMD Ryzens are different, but the issues come out the same: sometimes not all cores can be treated equally. The 8350 has four modules with two cores each. Each core has its own L1 instruction and data cache. It shares a 2 MByte L2 cache with its partner in the module, and there is an 8 Meg L3 cache. I assume your program is small enough that the compute tasks' instructions and data fit into the L1 caches. If you are using any floating point instructions or registers, that opens up more potential problems. I have some compute cores that work best on Bulldozer family AMD chips and Intel chips with Hyperthreading by using every other CPU number: 0,2,4,6 in your case. But I don't think this code runs into that. So far, so good. But it looks like you are getting tripped up by one or more data cache lines being shared between compute engines. (Instruction cache lines? Sharing is fine.) It could be an actual value shared among tasks, or several different values that get allocated in close proximity. I hope, and count on, task stacks not being adjacent, so this usually happens for (shared) variables in the parent of the tasks, or variables in the spec of generic library packages. If this happens, the cache management will result in just what you are seeing. Owning that cache line will act like a ring token passed from task to task. Parallel and Ta_Types are the two packages I'd be suspicious of. The detail here that may be biting you is that the variables in these packages are on the main stack, not duplicated, if necessary, in each task stack. Eventually you get to the point of paranoia where you make anything that goes on the main stack a multiple of 64 or 128 bytes and insure that the compiler follows your intent. You also have the worker tasks copy as constants any main program variables that are, in their view, constants. Finally, just good task programming. If you expect to have each task on its own CPU core or thread, use affinities to tie them to specific cores. Why? Modern processors do not flush all caches when an interrupt is serviced. If you have an interrupt that doesn't, you want the same task back on that CPU or thread. (In fact, some CPUs go further, and have ownership tags on the cache lines. So some data in cache can belong to the OS, and the rest of it to your task.) Note that when setting affinities, CPU 0 becomes affinity 1, etc. For each thread there is a bit vector of threads it can run on. On Windows, the argument is a hex number that converts to a bit vector. On a Hyperthreaded or Zen CPU, affinity 3 means run on either thread on CPU 0. In your case 3 would mean run on either of the processors in module 0, and so on. Setting affinity to 0 is not a good idea. By the way, is the duplicate value 'X' in the declaration of Ta_Types.Chip_Name intentional? Certainly worth a comment if it is. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Poor performance after upgrate to xubuntu 17.10 2017-10-22 22:04 ` Robert Eachus @ 2017-10-23 6:11 ` Luke A. Guest 2017-10-23 8:00 ` Mark Lorenzen 2017-10-25 18:56 ` Charly 1 sibling, 1 reply; 7+ messages in thread From: Luke A. Guest @ 2017-10-23 6:11 UTC (permalink / raw) Robert Eachus <rieachus@comcast.net> wrote: . > > Finally, just good task programming. If you expect to have each task on > its own CPU core or thread, use affinities to tie them to specific cores. How do you do this in Ada? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Poor performance after upgrate to xubuntu 17.10 2017-10-23 6:11 ` Luke A. Guest @ 2017-10-23 8:00 ` Mark Lorenzen 0 siblings, 0 replies; 7+ messages in thread From: Mark Lorenzen @ 2017-10-23 8:00 UTC (permalink / raw) On Monday, October 23, 2017 at 8:11:25 AM UTC+2, Luke A. Guest wrote: > Robert Eachus <rieachus@comcast.net> wrote: > . > > > > Finally, just good task programming. If you expect to have each task on > > its own CPU core or thread, use affinities to tie them to specific cores. > > How do you do this in Ada? Aspect/pragma "CPU". Regards, Mark L ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Poor performance after upgrate to xubuntu 17.10 2017-10-22 22:04 ` Robert Eachus 2017-10-23 6:11 ` Luke A. Guest @ 2017-10-25 18:56 ` Charly 1 sibling, 0 replies; 7+ messages in thread From: Charly @ 2017-10-25 18:56 UTC (permalink / raw) Am Montag, 23. Oktober 2017 00:04:19 UTC+2 schrieb Robert Eachus: > On Saturday, October 21, 2017 at 6:41:59 AM UTC-4, Charly wrote: > > My Hardware > > AMD FX(tm)-8350 Eight-Core Processor > > Oh boy! Welcome to the wonderful world of modern tasking. Intel chips with Hyperthreading and the new AMD Ryzens are different, but the issues come out the same: sometimes not all cores can be treated equally. > > The 8350 has four modules with two cores each. Each core has its own L1 instruction and data cache. It shares a 2 MByte L2 cache with its partner in the module, and there is an 8 Meg L3 cache. I assume your program is small enough that the compute tasks' instructions and data fit into the L1 caches. > > If you are using any floating point instructions or registers, that opens up more potential problems. I have some compute cores that work best on Bulldozer family AMD chips and Intel chips with Hyperthreading by using every other CPU number: 0,2,4,6 in your case. But I don't think this code runs into that. > > So far, so good. But it looks like you are getting tripped up by one or more data cache lines being shared between compute engines. (Instruction cache lines? Sharing is fine.) It could be an actual value shared among tasks, or several different values that get allocated in close proximity. I hope, and count on, task stacks not being adjacent, so this usually happens for (shared) variables in the parent of the tasks, or variables in the spec of generic library packages. > > If this happens, the cache management will result in just what you are seeing. Owning that cache line will act like a ring token passed from task to task. Parallel and Ta_Types are the two packages I'd be suspicious of. The detail here that may be biting you is that the variables in these packages are on the main stack, not duplicated, if necessary, in each task stack. > > Eventually you get to the point of paranoia where you make anything that goes on the main stack a multiple of 64 or 128 bytes and insure that the compiler follows your intent. You also have the worker tasks copy as constants any main program variables that are, in their view, constants. > > Finally, just good task programming. If you expect to have each task on its own CPU core or thread, use affinities to tie them to specific cores. Why? Modern processors do not flush all caches when an interrupt is serviced. If you have an interrupt that doesn't, you want the same task back on that CPU or thread. (In fact, some CPUs go further, and have ownership tags on the cache lines. So some data in cache can belong to the OS, and the rest of it to your task.) > > Note that when setting affinities, CPU 0 becomes affinity 1, etc. For each thread there is a bit vector of threads it can run on. On Windows, the argument is a hex number that converts to a bit vector. On a Hyperthreaded or Zen CPU, affinity 3 means run on either thread on CPU 0. In your case 3 would mean run on either of the processors in module 0, and so on. Setting affinity to 0 is not a good idea. > > By the way, is the duplicate value 'X' in the declaration of Ta_Types.Chip_Name intentional? Certainly worth a comment if it is. Hi, thank you for your elaborate answer but I still don't see why I got this strange behaviour, after I upgrated to the new version. For the previous 8 ubuntu-versions that I used for this hareware I had a decreasing runtime and constant total CPU usage when I increased the number of tasks. But now the runtime almost constant the total CPU usage increases with the number of tasks. I assume it's caused by the interaction of gnat and the new glib/libpthread libraries. B.t.w. the two 'X' in Chip_Name is not a typo. There are 24 permutations of the colors red, green, blue and yellow, but 25 tiles so one permutation must occure twice. Sincerly Charly ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-10-25 18:56 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-10-21 10:41 Poor performance after upgrate to xubuntu 17.10 Charly 2017-10-21 19:58 ` Chris M Moore 2017-10-22 20:31 ` Charly 2017-10-22 22:04 ` Robert Eachus 2017-10-23 6:11 ` Luke A. Guest 2017-10-23 8:00 ` Mark Lorenzen 2017-10-25 18:56 ` Charly
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox