Re: multicore-multithreading benchmarks

comp.lang.ada
 help / color / mirror / Atom feed

* Re: multicore-multithreading benchmarks
@ 2006-12-22  2:48 tmoran
  2006-12-22  6:14 ` tmoran
  0 siblings, 1 reply; 9+ messages in thread
From: tmoran @ 2006-12-22  2:48 UTC (permalink / raw)


Karl Nyberg was kind enough to run the N-CPU quicksort test on his Sun
"Try and Buy" evaluation T1000, with 8 cores, 4 threads per core and got
the results below.  For the large-N cases there is a significant speedup
when N becomes a larger power of two.  That's logical since the number of
partitions at any given time is a power of two (approximately, since the
split may not have been exactly even).

17:43 1646 x: ./trysortn
CPUs N= 1000        N= 10000      N= 100000     N= 1000000
 1    0.001456250    0.018297000   0.233244500   2.654417750
 2    0.001455750    0.010290500   0.129741000   1.399496750
 3    0.001457000    0.009199000   0.125779500   1.397285750
 4    0.001456750    0.007722000   0.126326000   0.787722000
 5    0.001458250    0.007488000   0.099067250   0.780770500
 6    0.001456500    0.007534000   0.094341500   0.754744250
 7    0.001457500    0.006942000   0.094361000   0.744371500
 8    0.001456750    0.007131000   0.083258000   0.693531250
17:43 1647 x: ./trysortn
CPUs N= 1000        N= 10000      N= 100000     N= 1000000
 1    0.001419750    0.018249250   0.227546000   2.614201000
 2    0.001420750    0.010317750   0.120750000   1.363506500
 3    0.001418000    0.009661250   0.103181250   1.078301000
 4    0.001421500    0.006736750   0.097727750   0.760205250
 5    0.001418250    0.006691750   0.092696000   0.728165250
 6    0.001418250    0.006635500   0.076543750   0.663240250
 7    0.001418250    0.006858750   0.077237750   0.598086250
 8    0.001422000    0.007047500   0.074811750   0.481415250
17:44 1648 x: ./trysortn
CPUs N= 1000        N= 10000      N= 100000     N= 1000000
 1    0.001445750    0.018350750   0.233402000   2.719813750
 2    0.001444250    0.010359250   0.126730750   1.437119750
 3    0.001445250    0.008780250   0.125607250   1.437211500
 4    0.001443500    0.006597250   0.125789000   1.150943750
 5    0.001445500    0.006642000   0.102507250   1.132046500
 6    0.001444000    0.006687000   0.100985000   0.937661500
 7    0.001445250    0.007033000   0.096396000   0.929958500
 8    0.001444000    0.007120500   0.091150250   0.870250250



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: multicore-multithreading benchmarks
  2006-12-22  2:48 multicore-multithreading benchmarks tmoran
@ 2006-12-22  6:14 ` tmoran
  2006-12-22 17:12   ` Colin Paul Gloster
  0 siblings, 1 reply; 9+ messages in thread
From: tmoran @ 2006-12-22  6:14 UTC (permalink / raw)


>Karl Nyberg was kind enough to run the N-CPU quicksort test on his Sun
>"Try and Buy" evaluation T1000, with 8 cores, 4 threads per core and got
  Karl did additional runs using up to 32 tasks.  Going from 1 to 4 tasks
more than tripled the speed.  Adding another 28 tasks, for a total of 32,
gave nearly another doubling of the speed (approximately 3% improvement
for each additional task).



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: multicore-multithreading benchmarks
  2006-12-22  6:14 ` tmoran
@ 2006-12-22 17:12   ` Colin Paul Gloster
  2006-12-23  6:50     ` tmoran
  0 siblings, 1 reply; 9+ messages in thread
From: Colin Paul Gloster @ 2006-12-22 17:12 UTC (permalink / raw)


Dear Mr. Moran,

Tom Moran posted in news:g8OdnXUIoJrz2hbYnZ2dnUVZ_v63nZ2d@comcast.com
:

"Karl Nyberg was kind enough to run the N-CPU quicksort test on his
Sun
"Try and Buy" evaluation T1000, with 8 cores, 4 threads per core and
got
the results below.  For the large-N cases there is a significant
speedup
when N becomes a larger power of two.  That's logical since the number
of
partitions at any given time is a power of two (approximately, since
the
split may not have been exactly even).

[..]
CPUs N= 1000        N= 10000      N= 100000     N= 1000000
[..]
 7    0.001457500    0.006942000   0.094361000   0.744371500
 8    0.001456750    0.007131000   0.083258000   0.693531250
[..]
 7    0.001418250    0.006858750   0.077237750   0.598086250
 8    0.001422000    0.007047500   0.074811750   0.481415250
[..]
 7    0.001445250    0.007033000   0.096396000   0.929958500
 8    0.001444000    0.007120500   0.091150250   0.870250250"

We can note that with N= 10000, 8 CPUs are often slightly slower in
Nyberg's; Alexander's; and my measurements (below) than 7 so-called
"CPUs", though this is not nearly as dramatic as the erratic timings
of finding the first 32 prime numbers in Wai-Mee Ching and Alex Katz,
"An Experimental APL Compiler for a Distributed Memory Parallel
Machine", Supercomputing conference 1994.

Except for my last run, TRYSORTN was the only process reported by top
to be using more than 2 "%CPU". For most of a run of most of the runs,
TRYSORTN was reported by top to be using approximately 41% or
sixty-something% or seventy-something% or 99.9 "%CPU" depending on the
run. I compiled TRYSORTN with gnatmake -O3 and I ran TRYSORTN on a
machine with four x86_64s cores (AMD Opterons) (1.8GHz) which in many
cases was outperformed by the "dual AMD Opteron" which Alexander used
( news:1166492674.379186.85310@t46g2000cwa.googlegroups.com and
news:1166497708.533742.140650@48g2000cwx.googlegroups.com ). It may be
worthwhile if Alexander provides more details (e.g. clock speed), but
these would probably still not justify the slowness of the four AMD
Opterons. If we divide the Sun Niagara T1000's numbers in
news:g8OdnXUIoJrz2hbYnZ2dnUVZ_v63nZ2d@comcast.com by six (
news:icednfoKjsdI6hbYnZ2dnUVZ_r2onZ2d@comcast.com ), many of the
speeds are slower than the "dual AMD Opteron"'s so the approximately
$14445 for the Sun Niagara T1000 might not be value for money.

Regards,
Colin Paul Gloster


time ./TRYSORTN
CPUs N= 1000        N= 10000      N= 100000     N= 1000000
 1    0.000085250    0.001124500   0.013804250   0.170973750
 2    0.000085250    0.000998500   0.009506500   0.092315750
 3    0.000085250    0.000915500   0.008373250   0.081367000
 4    0.000085000    0.001170250   0.009327500   0.074474500
 5    0.000083750    0.001054250   0.008961750   0.076013750
 6    0.000085250    0.001531250   0.009179000   0.075535000
 7    0.000083750    0.001367250   0.009538000   0.065402000
 8    0.000085250    0.001385250   0.007406250   0.064408750

real    0m4.674s
user    0m8.261s
sys     0m0.096s

 time ./TRYSORTN
CPUs N= 1000        N= 10000      N= 100000     N= 1000000
 1    0.000081500    0.001108000   0.013810000   0.172433500
 2    0.000082750    0.000849500   0.008702750   0.092600250
 3    0.000082750    0.001033000   0.010324250   0.074759750
 4    0.000082750    0.001243000   0.008898500   0.076791750
 5    0.000082750    0.001232500   0.008550500   0.077942000
 6    0.000084750    0.001320750   0.008310250   0.077038000
 7    0.000082750    0.001440500   0.008635500   0.063810750
 8    0.000083000    0.001594750   0.008516750   0.065588750

real    0m5.062s
user    0m8.337s
sys     0m0.100s

 time ./TRYSORTN
CPUs N= 1000        N= 10000      N= 100000     N= 1000000
 1    0.000083250    0.001178750   0.013625500   0.172203750
 2    0.000085000    0.001180750   0.009293750   0.092631750
 3    0.000084750    0.001093250   0.009663750   0.076857000
 4    0.000089750    0.000881500   0.008177500   0.057204750
 5    0.000087000    0.000946500   0.008502000   0.056925750
 6    0.000086500    0.001260000   0.007659000   0.057980000
 7    0.000083500    0.001646750   0.008052000   0.056552750
 8    0.000084750    0.001221000   0.007820750   0.063489000

real    0m4.242s
user    0m8.229s
sys     0m0.056s

 time ./TRYSORTN
CPUs N= 1000        N= 10000      N= 100000     N= 1000000
 1    0.000089250    0.001124750   0.014132000   0.175325000
 2    0.000090750    0.001117750   0.010619000   0.093166000
 3    0.000090750    0.000858750   0.009654000   0.079514500
 4    0.000091250    0.001423750   0.008418500   0.075662750
 5    0.000090500    0.001086250   0.009355750   0.081312750
 6    0.000090750    0.001260500   0.009833500   0.077358250
 7    0.000091000    0.001032750   0.010370500   0.067873000
 8    0.000090500    0.001441750   0.008750250   0.065373250

real    0m4.640s
user    0m8.357s
sys     0m0.124s

 time ./TRYSORTN
CPUs N= 1000        N= 10000      N= 100000     N= 1000000
 1    0.000082750    0.001063250   0.014282000   0.170518750
 2    0.000083750    0.000688250   0.011545500   0.093950500
 3    0.000084000    0.000972500   0.010276250   0.085763250
 4    0.000082500    0.001058500   0.009569500   0.092093500
 5    0.000083000    0.001263500   0.012566750   0.076557250
 6    0.000084000    0.001459500   0.009440250   0.074283000
 7    0.000082250    0.001378750   0.008399000   0.079315750
 8    0.000085750    0.001449000   0.010754750   0.071387000

real    0m4.843s
user    0m8.209s
sys     0m0.128s

 time ./TRYSORTN
CPUs N= 1000        N= 10000      N= 100000     N= 1000000
 1    0.000090000    0.001139750   0.014450000   0.165776750
 2    0.000086250    0.001113000   0.009400250   0.087525750
 3    0.000091750    0.001160000   0.008839000   0.082454750
 4    0.000088000    0.001009500   0.009091000   0.054664750
 5    0.000086250    0.001241500   0.010316250   0.054852750
 6    0.000086250    0.001354000   0.009450250   0.053322500
 7    0.000086500    0.001272250   0.008177000   0.059826500
 8    0.000086250    0.001562500   0.009065750   0.057336250

real    0m4.287s
user    0m8.069s
sys     0m0.064s

gnatmake -v

GNATMAKE 4.0.2 20051125 (Red Hat 4.0.2-8)
Copyright 1995-2004 Free Software Foundation, Inc.

gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix--
				----enable-checking=release
				--with-system-zlib--
				----enable-__cxa_atexit
				--disable-libunwind-exceptions--
				----enable-libgcj-multifile
				--enable-languages=c,c++,objc,java,f95,ada --enable-java-awt=gtk --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --host=x86_64-redhat-linux
Thread model: posix
gcc version 4.0.2 20051125 (Red Hat 4.0.2-8)

uname --all
Linux urano.iet.unipi.it 2.6.16-1.2115_FC4smp #1 SMP Mon Jun 5
15:01:20 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: multicore-multithreading benchmarks
  2006-12-22 17:12   ` Colin Paul Gloster
@ 2006-12-23  6:50     ` tmoran
  2006-12-26 21:58       ` Chip Orange
  0 siblings, 1 reply; 9+ messages in thread
From: tmoran @ 2006-12-23  6:50 UTC (permalink / raw)


>We can note that with N= 10000, 8 CPUs are often slightly slower in
>Nyberg's; Alexander's; and my measurements (below) than 7 so-called
>"CPUs", though this is not nearly as dramatic as the erratic timings
  The timings are short and thus highly subject to other things going on.
Trysortn already does four runs and returns the average, but, especially
when the "noise" is bursty, the result is still very erratic.  The N=1000
case should probably be dropped since it's too fast for good timings, and
too small for a multitasking advantage.  Even N=10_000 is pretty small.
It would be interesting though to see the results for N=10**7 since that
ought to show some cache contention under multi-tasking.
  Thanks for posting the timings on your 4 core machine.  They seem
broadly in agreement on the value of additional CPUs/cores.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: multicore-multithreading benchmarks
  2006-12-23  6:50     ` tmoran
@ 2006-12-26 21:58       ` Chip Orange
  2006-12-27  8:57         ` tmoran
  0 siblings, 1 reply; 9+ messages in thread
From: Chip Orange @ 2006-12-26 21:58 UTC (permalink / raw)



<tmoran@acm.org> wrote in message 
news:itydnQzMmL9XTBHYnZ2dnUVZ_r-onZ2d@comcast.com...
> >We can note that with N= 10000, 8 CPUs are often slightly slower in
>>Nyberg's; Alexander's; and my measurements (below) than 7 so-called
>>"CPUs", though this is not nearly as dramatic as the erratic timings
>  The timings are short and thus highly subject to other things going on.
> Trysortn already does four runs and returns the average, but, especially
> when the "noise" is bursty, the result is still very erratic.  The N=1000
> case should probably be dropped since it's too fast for good timings, and
> too small for a multitasking advantage.  Even N=10_000 is pretty small.
> It would be interesting though to see the results for N=10**7 since that
> ought to show some cache contention under multi-tasking.
>  Thanks for posting the timings on your 4 core machine.  They seem
> broadly in agreement on the value of additional CPUs/cores.

Hi Tom,

I'm just beginning to learn Ada, but one of the attributes that drew me to 
it was the chance to learn a new paradigm of programming: that is, dividing 
a programming goal into tasks that can be performed concurrently in order to 
achieve much higher efficiencies.

I appreciate TrySortn very much as a learning aid, but I'm wondering how a 
programmer should go about determining the number of cores dynamicly in 
order to make the best guess as to how many tasks to break his problem into? 
Is there a standard Ada library that would give some system info on the 
number of cores, and/or, perhaps some info on how heavily loaded the system 
is (perhaps if it's already heavily loaded you wouldn't necessarily want to 
try to set a task going to each core?).



Thanks for any suggestions.

Chip






^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: multicore-multithreading benchmarks
  2006-12-26 21:58       ` Chip Orange
@ 2006-12-27  8:57         ` tmoran
  2006-12-27 17:53           ` Chip Orange
  0 siblings, 1 reply; 9+ messages in thread
From: tmoran @ 2006-12-27  8:57 UTC (permalink / raw)


>wondering how a programmer should go about determining the number of
>cores dynamicly in order to make the best guess as to how many tasks to
>break his problem into?  Is there a standard Ada library that would give
>some system info on the number of cores, and/or, perhaps some info on how
>heavily loaded the system is (perhaps if it's already heavily loaded you
>wouldn't necessarily want to try to set a task going to each core?).
  The Windows API call GetProcessAffinityMap tells how many CPUs/cores are
available.  In this (sorting) case, one could give a lower priority to one
of the tasks, so if multiple cores are available, you would have
multitasking, while if only one core wasn't busy, it would run the tasks
sequentially - essentially the same as if they weren't multitasked.
  The programmer of a library routine would ideally like to be able to ask
"what's the current cost of CPU cycles, RAM, and cache space" and then
decide how much multitasking to do, or dynamically how to make
compute/memory tradeoffs, and he'd like to write his routines to change
behavior when the relative costs/scarcities change.  Pending that
desirable situation, I just throw up my hands, include a CPUs_To_Use
parameter to the sort routine, and let someone who knows more about the
particular application and its likely environment make that decision.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: multicore-multithreading benchmarks
  2006-12-27  8:57         ` tmoran
@ 2006-12-27 17:53           ` Chip Orange
  2006-12-27 18:46             ` Jeffrey Creem
  2006-12-27 19:51             ` tmoran
  0 siblings, 2 replies; 9+ messages in thread
From: Chip Orange @ 2006-12-27 17:53 UTC (permalink / raw)



<tmoran@acm.org> wrote in message 
news:TOCdnecag8P5qA_YnZ2dnUVZ_qunnZ2d@comcast.com...
> >wondering how a programmer should go about determining the number of
>>cores dynamicly in order to make the best guess as to how many tasks to
>>break his problem into?  Is there a standard Ada library that would give
>>some system info on the number of cores, and/or, perhaps some info on how
>>heavily loaded the system is (perhaps if it's already heavily loaded you
>>wouldn't necessarily want to try to set a task going to each core?).
>  The Windows API call GetProcessAffinityMap tells how many CPUs/cores are
> available.  In this (sorting) case, one could give a lower priority to one
> of the tasks, so if multiple cores are available, you would have
> multitasking, while if only one core wasn't busy, it would run the tasks
> sequentially - essentially the same as if they weren't multitasked.
>  The programmer of a library routine would ideally like to be able to ask
> "what's the current cost of CPU cycles, RAM, and cache space" and then
> decide how much multitasking to do, or dynamically how to make
> compute/memory tradeoffs, and he'd like to write his routines to change
> behavior when the relative costs/scarcities change.  Pending that
> desirable situation, I just throw up my hands, include a CPUs_To_Use
> parameter to the sort routine, and let someone who knows more about the
> particular application and its likely environment make that decision.

Thank you for the reply, and for the Windows API call information.

It sounds like you're telling me, because you gave me the Windows API, that 
there's no "standard" way in Ada to handle any of this?  Does this seem like 
a lack in Ada to you as it does to me, or am I missing some bigger picture?

With multicore/cpus being readily available, wouldn't you think any 
programmer should be rewriting his applications to take advantage of this, 
and so, would need this type of information to do so?  Or should we, as a 
general rule, just break all programming tasks into threads, assuming that 
the OS will handle load balancing better than we could?

Thanks.

Chip










^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: multicore-multithreading benchmarks
  2006-12-27 17:53           ` Chip Orange
@ 2006-12-27 18:46             ` Jeffrey Creem
  2006-12-27 19:51             ` tmoran
  1 sibling, 0 replies; 9+ messages in thread
From: Jeffrey Creem @ 2006-12-27 18:46 UTC (permalink / raw)

Chip Orange wrote:

> 
> Thank you for the reply, and for the Windows API call information.
> 
> It sounds like you're telling me, because you gave me the Windows API, that 
> there's no "standard" way in Ada to handle any of this?  Does this seem like 
> a lack in Ada to you as it does to me, or am I missing some bigger picture?
> 

You are not missing anything. There is nothing in native Ada that will 
give you the information that you are looking for (At least not in Ada 
83/95 - Since I have not really looked at much of the new tasking 
features of 2005 I can't say for certain that there is nothing there).

> With multicore/cpus being readily available, wouldn't you think any 
> programmer should be rewriting his applications to take advantage of this, 
> and so, would need this type of information to do so?  Or should we, as a 
> general rule, just break all programming tasks into threads, assuming that 
> the OS will handle load balancing better than we could?
> 

I think it sort of depends on what you are doing. When we talk about 
future multicore machines  you often hear people say that nothing will 
take advantage of more than 2 cores given the way todays software is 
written. That might be true in many cases. Certainly the work I do (for 
pay) will scale pretty well beyond 2 cores but it will hit a limit not 
long after. While the programs typically have a lot of tasks(when 
compared to the single core they are running on), many of them are 
sporadic low CPU usage tasks so they will not find a lot of use for more 
than a few CPUs.

It is not an easy problem even if the API information was available. It 
is not clear that every library call should be looking at CPUs and 
coming up with a # of threats to start for every low level operations 
(e.g. sort, matrix multiply, etc).

Having said that, building components with build in support for 
arbitrary numbers of tasks (when appropriate) does allow the person 
architecting the final program the flexibility to limit the non-portable 
calls to a small subset of the program and/or to simply make reasonable 
guesses when putting the program together.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: multicore-multithreading benchmarks
  2006-12-27 17:53           ` Chip Orange
  2006-12-27 18:46             ` Jeffrey Creem
@ 2006-12-27 19:51             ` tmoran
  1 sibling, 0 replies; 9+ messages in thread
From: tmoran @ 2006-12-27 19:51 UTC (permalink / raw)


>there's no "standard" way in Ada to handle any of this?  Does this seem like
>a lack in Ada to you as it does to me, or am I missing some bigger picture?
  This is clearly very OS dependent.  See below for an Ada program that
gets the "affinity masks" on Windows.

>With multicore/cpus being readily available, wouldn't you think any
>programmer should be rewriting his applications to take advantage of this,
>and so, would need this type of information to do so?  Or should we, as a
>general rule, just break all programming tasks into threads, assuming that
>the OS will handle load balancing better than we could?
  Well, on dual-core, one of the cores can run the anti-virus while the
other does useful work. ;)
  Seriously, if the OS knows more than you do about the demands on the
machine, it can, potentially, do a better job of allocating resources.
If you know more about the requirements than it does, then you should
be able to do the better job.  Also, most of today's machines are still
single CPU, and tomorrow they may have a thousand CPUs, so it's hard
to design a program that's optimized for all situations.  (Note that
Windows' "affinity mask" is limited to a max of 32 CPUs.)  Judicious
allocation of CPU resources with limited fore-knowledge is part of
the program designer's job.
  Remember also that multi-tasking is not merely a speed-up technique,
but is also a design tool.  Some programs are more naturally written
if you use multiple tasks, even if they run on single CPUs.

with Ada.Text_IO,
     Interfaces.C;
procedure Showaff is
  use type Interfaces.C.Unsigned;
  type Process_Handles is new Interfaces.C.Unsigned;
  type Masks is new Interfaces.C.Unsigned;
  type Bool is new Interfaces.C.Unsigned;

  package Hex_IO is new Ada.Text_IO.Modular_IO(Masks);

  function Get_Current_Process return Process_Handles;
  pragma Import(Stdcall, Get_Current_Process, "GetCurrentProcess");

  function Get_Process_Affinity_Mask(Process :  Process_Handles;
                                     Process_Mask,
                                     System_Mask:  access Masks) return Bool;
  pragma Import(Stdcall, Get_Process_Affinity_Mask, "GetProcessAffinityMask");

  Me : constant Process_Handles := Get_Current_Process;
  My_Mask, System_Mask: aliased Masks;
begin
  if Get_Process_Affinity_Mask(Me, My_Mask'access, System_Mask'access) = 0 then
    Ada.Text_IO.Put_Line("GetProcessAffinityMask failed!");
  else
    Ada.Text_IO.Put("my mask=");
    Hex_IO.Put(My_Mask, Base => 16);
    Ada.Text_IO.Put(" system mask=");
    Hex_IO.Put(System_Mask, Base => 16);
    Ada.Text_IO.New_Line;
  end if;
end Showaff;



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-12-27 19:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-22  2:48 multicore-multithreading benchmarks tmoran
2006-12-22  6:14 ` tmoran
2006-12-22 17:12   ` Colin Paul Gloster
2006-12-23  6:50     ` tmoran
2006-12-26 21:58       ` Chip Orange
2006-12-27  8:57         ` tmoran
2006-12-27 17:53           ` Chip Orange
2006-12-27 18:46             ` Jeffrey Creem
2006-12-27 19:51             ` tmoran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox