* Re: multicore-multithreading benchmarks @ 2006-12-22 2:48 tmoran 2006-12-22 6:14 ` tmoran 0 siblings, 1 reply; 9+ messages in thread From: tmoran @ 2006-12-22 2:48 UTC (permalink / raw) Karl Nyberg was kind enough to run the N-CPU quicksort test on his Sun "Try and Buy" evaluation T1000, with 8 cores, 4 threads per core and got the results below. For the large-N cases there is a significant speedup when N becomes a larger power of two. That's logical since the number of partitions at any given time is a power of two (approximately, since the split may not have been exactly even). 17:43 1646 x: ./trysortn CPUs N= 1000 N= 10000 N= 100000 N= 1000000 1 0.001456250 0.018297000 0.233244500 2.654417750 2 0.001455750 0.010290500 0.129741000 1.399496750 3 0.001457000 0.009199000 0.125779500 1.397285750 4 0.001456750 0.007722000 0.126326000 0.787722000 5 0.001458250 0.007488000 0.099067250 0.780770500 6 0.001456500 0.007534000 0.094341500 0.754744250 7 0.001457500 0.006942000 0.094361000 0.744371500 8 0.001456750 0.007131000 0.083258000 0.693531250 17:43 1647 x: ./trysortn CPUs N= 1000 N= 10000 N= 100000 N= 1000000 1 0.001419750 0.018249250 0.227546000 2.614201000 2 0.001420750 0.010317750 0.120750000 1.363506500 3 0.001418000 0.009661250 0.103181250 1.078301000 4 0.001421500 0.006736750 0.097727750 0.760205250 5 0.001418250 0.006691750 0.092696000 0.728165250 6 0.001418250 0.006635500 0.076543750 0.663240250 7 0.001418250 0.006858750 0.077237750 0.598086250 8 0.001422000 0.007047500 0.074811750 0.481415250 17:44 1648 x: ./trysortn CPUs N= 1000 N= 10000 N= 100000 N= 1000000 1 0.001445750 0.018350750 0.233402000 2.719813750 2 0.001444250 0.010359250 0.126730750 1.437119750 3 0.001445250 0.008780250 0.125607250 1.437211500 4 0.001443500 0.006597250 0.125789000 1.150943750 5 0.001445500 0.006642000 0.102507250 1.132046500 6 0.001444000 0.006687000 0.100985000 0.937661500 7 0.001445250 0.007033000 0.096396000 0.929958500 8 0.001444000 0.007120500 0.091150250 0.870250250 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multicore-multithreading benchmarks 2006-12-22 2:48 multicore-multithreading benchmarks tmoran @ 2006-12-22 6:14 ` tmoran 2006-12-22 17:12 ` Colin Paul Gloster 0 siblings, 1 reply; 9+ messages in thread From: tmoran @ 2006-12-22 6:14 UTC (permalink / raw) >Karl Nyberg was kind enough to run the N-CPU quicksort test on his Sun >"Try and Buy" evaluation T1000, with 8 cores, 4 threads per core and got Karl did additional runs using up to 32 tasks. Going from 1 to 4 tasks more than tripled the speed. Adding another 28 tasks, for a total of 32, gave nearly another doubling of the speed (approximately 3% improvement for each additional task). ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multicore-multithreading benchmarks 2006-12-22 6:14 ` tmoran @ 2006-12-22 17:12 ` Colin Paul Gloster 2006-12-23 6:50 ` tmoran 0 siblings, 1 reply; 9+ messages in thread From: Colin Paul Gloster @ 2006-12-22 17:12 UTC (permalink / raw) Dear Mr. Moran, Tom Moran posted in news:g8OdnXUIoJrz2hbYnZ2dnUVZ_v63nZ2d@comcast.com : "Karl Nyberg was kind enough to run the N-CPU quicksort test on his Sun "Try and Buy" evaluation T1000, with 8 cores, 4 threads per core and got the results below. For the large-N cases there is a significant speedup when N becomes a larger power of two. That's logical since the number of partitions at any given time is a power of two (approximately, since the split may not have been exactly even). [..] CPUs N= 1000 N= 10000 N= 100000 N= 1000000 [..] 7 0.001457500 0.006942000 0.094361000 0.744371500 8 0.001456750 0.007131000 0.083258000 0.693531250 [..] 7 0.001418250 0.006858750 0.077237750 0.598086250 8 0.001422000 0.007047500 0.074811750 0.481415250 [..] 7 0.001445250 0.007033000 0.096396000 0.929958500 8 0.001444000 0.007120500 0.091150250 0.870250250" We can note that with N= 10000, 8 CPUs are often slightly slower in Nyberg's; Alexander's; and my measurements (below) than 7 so-called "CPUs", though this is not nearly as dramatic as the erratic timings of finding the first 32 prime numbers in Wai-Mee Ching and Alex Katz, "An Experimental APL Compiler for a Distributed Memory Parallel Machine", Supercomputing conference 1994. Except for my last run, TRYSORTN was the only process reported by top to be using more than 2 "%CPU". For most of a run of most of the runs, TRYSORTN was reported by top to be using approximately 41% or sixty-something% or seventy-something% or 99.9 "%CPU" depending on the run. I compiled TRYSORTN with gnatmake -O3 and I ran TRYSORTN on a machine with four x86_64s cores (AMD Opterons) (1.8GHz) which in many cases was outperformed by the "dual AMD Opteron" which Alexander used ( news:1166492674.379186.85310@t46g2000cwa.googlegroups.com and news:1166497708.533742.140650@48g2000cwx.googlegroups.com ). It may be worthwhile if Alexander provides more details (e.g. clock speed), but these would probably still not justify the slowness of the four AMD Opterons. If we divide the Sun Niagara T1000's numbers in news:g8OdnXUIoJrz2hbYnZ2dnUVZ_v63nZ2d@comcast.com by six ( news:icednfoKjsdI6hbYnZ2dnUVZ_r2onZ2d@comcast.com ), many of the speeds are slower than the "dual AMD Opteron"'s so the approximately $14445 for the Sun Niagara T1000 might not be value for money. Regards, Colin Paul Gloster time ./TRYSORTN CPUs N= 1000 N= 10000 N= 100000 N= 1000000 1 0.000085250 0.001124500 0.013804250 0.170973750 2 0.000085250 0.000998500 0.009506500 0.092315750 3 0.000085250 0.000915500 0.008373250 0.081367000 4 0.000085000 0.001170250 0.009327500 0.074474500 5 0.000083750 0.001054250 0.008961750 0.076013750 6 0.000085250 0.001531250 0.009179000 0.075535000 7 0.000083750 0.001367250 0.009538000 0.065402000 8 0.000085250 0.001385250 0.007406250 0.064408750 real 0m4.674s user 0m8.261s sys 0m0.096s time ./TRYSORTN CPUs N= 1000 N= 10000 N= 100000 N= 1000000 1 0.000081500 0.001108000 0.013810000 0.172433500 2 0.000082750 0.000849500 0.008702750 0.092600250 3 0.000082750 0.001033000 0.010324250 0.074759750 4 0.000082750 0.001243000 0.008898500 0.076791750 5 0.000082750 0.001232500 0.008550500 0.077942000 6 0.000084750 0.001320750 0.008310250 0.077038000 7 0.000082750 0.001440500 0.008635500 0.063810750 8 0.000083000 0.001594750 0.008516750 0.065588750 real 0m5.062s user 0m8.337s sys 0m0.100s time ./TRYSORTN CPUs N= 1000 N= 10000 N= 100000 N= 1000000 1 0.000083250 0.001178750 0.013625500 0.172203750 2 0.000085000 0.001180750 0.009293750 0.092631750 3 0.000084750 0.001093250 0.009663750 0.076857000 4 0.000089750 0.000881500 0.008177500 0.057204750 5 0.000087000 0.000946500 0.008502000 0.056925750 6 0.000086500 0.001260000 0.007659000 0.057980000 7 0.000083500 0.001646750 0.008052000 0.056552750 8 0.000084750 0.001221000 0.007820750 0.063489000 real 0m4.242s user 0m8.229s sys 0m0.056s time ./TRYSORTN CPUs N= 1000 N= 10000 N= 100000 N= 1000000 1 0.000089250 0.001124750 0.014132000 0.175325000 2 0.000090750 0.001117750 0.010619000 0.093166000 3 0.000090750 0.000858750 0.009654000 0.079514500 4 0.000091250 0.001423750 0.008418500 0.075662750 5 0.000090500 0.001086250 0.009355750 0.081312750 6 0.000090750 0.001260500 0.009833500 0.077358250 7 0.000091000 0.001032750 0.010370500 0.067873000 8 0.000090500 0.001441750 0.008750250 0.065373250 real 0m4.640s user 0m8.357s sys 0m0.124s time ./TRYSORTN CPUs N= 1000 N= 10000 N= 100000 N= 1000000 1 0.000082750 0.001063250 0.014282000 0.170518750 2 0.000083750 0.000688250 0.011545500 0.093950500 3 0.000084000 0.000972500 0.010276250 0.085763250 4 0.000082500 0.001058500 0.009569500 0.092093500 5 0.000083000 0.001263500 0.012566750 0.076557250 6 0.000084000 0.001459500 0.009440250 0.074283000 7 0.000082250 0.001378750 0.008399000 0.079315750 8 0.000085750 0.001449000 0.010754750 0.071387000 real 0m4.843s user 0m8.209s sys 0m0.128s time ./TRYSORTN CPUs N= 1000 N= 10000 N= 100000 N= 1000000 1 0.000090000 0.001139750 0.014450000 0.165776750 2 0.000086250 0.001113000 0.009400250 0.087525750 3 0.000091750 0.001160000 0.008839000 0.082454750 4 0.000088000 0.001009500 0.009091000 0.054664750 5 0.000086250 0.001241500 0.010316250 0.054852750 6 0.000086250 0.001354000 0.009450250 0.053322500 7 0.000086500 0.001272250 0.008177000 0.059826500 8 0.000086250 0.001562500 0.009065750 0.057336250 real 0m4.287s user 0m8.069s sys 0m0.064s gnatmake -v GNATMAKE 4.0.2 20051125 (Red Hat 4.0.2-8) Copyright 1995-2004 Free Software Foundation, Inc. gcc -v Using built-in specs. Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix-- ----enable-checking=release --with-system-zlib-- ----enable-__cxa_atexit --disable-libunwind-exceptions-- ----enable-libgcj-multifile --enable-languages=c,c++,objc,java,f95,ada --enable-java-awt=gtk --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --host=x86_64-redhat-linux Thread model: posix gcc version 4.0.2 20051125 (Red Hat 4.0.2-8) uname --all Linux urano.iet.unipi.it 2.6.16-1.2115_FC4smp #1 SMP Mon Jun 5 15:01:20 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multicore-multithreading benchmarks 2006-12-22 17:12 ` Colin Paul Gloster @ 2006-12-23 6:50 ` tmoran 2006-12-26 21:58 ` Chip Orange 0 siblings, 1 reply; 9+ messages in thread From: tmoran @ 2006-12-23 6:50 UTC (permalink / raw) >We can note that with N= 10000, 8 CPUs are often slightly slower in >Nyberg's; Alexander's; and my measurements (below) than 7 so-called >"CPUs", though this is not nearly as dramatic as the erratic timings The timings are short and thus highly subject to other things going on. Trysortn already does four runs and returns the average, but, especially when the "noise" is bursty, the result is still very erratic. The N=1000 case should probably be dropped since it's too fast for good timings, and too small for a multitasking advantage. Even N=10_000 is pretty small. It would be interesting though to see the results for N=10**7 since that ought to show some cache contention under multi-tasking. Thanks for posting the timings on your 4 core machine. They seem broadly in agreement on the value of additional CPUs/cores. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multicore-multithreading benchmarks 2006-12-23 6:50 ` tmoran @ 2006-12-26 21:58 ` Chip Orange 2006-12-27 8:57 ` tmoran 0 siblings, 1 reply; 9+ messages in thread From: Chip Orange @ 2006-12-26 21:58 UTC (permalink / raw) <tmoran@acm.org> wrote in message news:itydnQzMmL9XTBHYnZ2dnUVZ_r-onZ2d@comcast.com... > >We can note that with N= 10000, 8 CPUs are often slightly slower in >>Nyberg's; Alexander's; and my measurements (below) than 7 so-called >>"CPUs", though this is not nearly as dramatic as the erratic timings > The timings are short and thus highly subject to other things going on. > Trysortn already does four runs and returns the average, but, especially > when the "noise" is bursty, the result is still very erratic. The N=1000 > case should probably be dropped since it's too fast for good timings, and > too small for a multitasking advantage. Even N=10_000 is pretty small. > It would be interesting though to see the results for N=10**7 since that > ought to show some cache contention under multi-tasking. > Thanks for posting the timings on your 4 core machine. They seem > broadly in agreement on the value of additional CPUs/cores. Hi Tom, I'm just beginning to learn Ada, but one of the attributes that drew me to it was the chance to learn a new paradigm of programming: that is, dividing a programming goal into tasks that can be performed concurrently in order to achieve much higher efficiencies. I appreciate TrySortn very much as a learning aid, but I'm wondering how a programmer should go about determining the number of cores dynamicly in order to make the best guess as to how many tasks to break his problem into? Is there a standard Ada library that would give some system info on the number of cores, and/or, perhaps some info on how heavily loaded the system is (perhaps if it's already heavily loaded you wouldn't necessarily want to try to set a task going to each core?). Thanks for any suggestions. Chip ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multicore-multithreading benchmarks 2006-12-26 21:58 ` Chip Orange @ 2006-12-27 8:57 ` tmoran 2006-12-27 17:53 ` Chip Orange 0 siblings, 1 reply; 9+ messages in thread From: tmoran @ 2006-12-27 8:57 UTC (permalink / raw) >wondering how a programmer should go about determining the number of >cores dynamicly in order to make the best guess as to how many tasks to >break his problem into? Is there a standard Ada library that would give >some system info on the number of cores, and/or, perhaps some info on how >heavily loaded the system is (perhaps if it's already heavily loaded you >wouldn't necessarily want to try to set a task going to each core?). The Windows API call GetProcessAffinityMap tells how many CPUs/cores are available. In this (sorting) case, one could give a lower priority to one of the tasks, so if multiple cores are available, you would have multitasking, while if only one core wasn't busy, it would run the tasks sequentially - essentially the same as if they weren't multitasked. The programmer of a library routine would ideally like to be able to ask "what's the current cost of CPU cycles, RAM, and cache space" and then decide how much multitasking to do, or dynamically how to make compute/memory tradeoffs, and he'd like to write his routines to change behavior when the relative costs/scarcities change. Pending that desirable situation, I just throw up my hands, include a CPUs_To_Use parameter to the sort routine, and let someone who knows more about the particular application and its likely environment make that decision. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multicore-multithreading benchmarks 2006-12-27 8:57 ` tmoran @ 2006-12-27 17:53 ` Chip Orange 2006-12-27 18:46 ` Jeffrey Creem 2006-12-27 19:51 ` tmoran 0 siblings, 2 replies; 9+ messages in thread From: Chip Orange @ 2006-12-27 17:53 UTC (permalink / raw) <tmoran@acm.org> wrote in message news:TOCdnecag8P5qA_YnZ2dnUVZ_qunnZ2d@comcast.com... > >wondering how a programmer should go about determining the number of >>cores dynamicly in order to make the best guess as to how many tasks to >>break his problem into? Is there a standard Ada library that would give >>some system info on the number of cores, and/or, perhaps some info on how >>heavily loaded the system is (perhaps if it's already heavily loaded you >>wouldn't necessarily want to try to set a task going to each core?). > The Windows API call GetProcessAffinityMap tells how many CPUs/cores are > available. In this (sorting) case, one could give a lower priority to one > of the tasks, so if multiple cores are available, you would have > multitasking, while if only one core wasn't busy, it would run the tasks > sequentially - essentially the same as if they weren't multitasked. > The programmer of a library routine would ideally like to be able to ask > "what's the current cost of CPU cycles, RAM, and cache space" and then > decide how much multitasking to do, or dynamically how to make > compute/memory tradeoffs, and he'd like to write his routines to change > behavior when the relative costs/scarcities change. Pending that > desirable situation, I just throw up my hands, include a CPUs_To_Use > parameter to the sort routine, and let someone who knows more about the > particular application and its likely environment make that decision. Thank you for the reply, and for the Windows API call information. It sounds like you're telling me, because you gave me the Windows API, that there's no "standard" way in Ada to handle any of this? Does this seem like a lack in Ada to you as it does to me, or am I missing some bigger picture? With multicore/cpus being readily available, wouldn't you think any programmer should be rewriting his applications to take advantage of this, and so, would need this type of information to do so? Or should we, as a general rule, just break all programming tasks into threads, assuming that the OS will handle load balancing better than we could? Thanks. Chip ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multicore-multithreading benchmarks 2006-12-27 17:53 ` Chip Orange @ 2006-12-27 18:46 ` Jeffrey Creem 2006-12-27 19:51 ` tmoran 1 sibling, 0 replies; 9+ messages in thread From: Jeffrey Creem @ 2006-12-27 18:46 UTC (permalink / raw) Chip Orange wrote: > > Thank you for the reply, and for the Windows API call information. > > It sounds like you're telling me, because you gave me the Windows API, that > there's no "standard" way in Ada to handle any of this? Does this seem like > a lack in Ada to you as it does to me, or am I missing some bigger picture? > You are not missing anything. There is nothing in native Ada that will give you the information that you are looking for (At least not in Ada 83/95 - Since I have not really looked at much of the new tasking features of 2005 I can't say for certain that there is nothing there). > With multicore/cpus being readily available, wouldn't you think any > programmer should be rewriting his applications to take advantage of this, > and so, would need this type of information to do so? Or should we, as a > general rule, just break all programming tasks into threads, assuming that > the OS will handle load balancing better than we could? > I think it sort of depends on what you are doing. When we talk about future multicore machines you often hear people say that nothing will take advantage of more than 2 cores given the way todays software is written. That might be true in many cases. Certainly the work I do (for pay) will scale pretty well beyond 2 cores but it will hit a limit not long after. While the programs typically have a lot of tasks(when compared to the single core they are running on), many of them are sporadic low CPU usage tasks so they will not find a lot of use for more than a few CPUs. It is not an easy problem even if the API information was available. It is not clear that every library call should be looking at CPUs and coming up with a # of threats to start for every low level operations (e.g. sort, matrix multiply, etc). Having said that, building components with build in support for arbitrary numbers of tasks (when appropriate) does allow the person architecting the final program the flexibility to limit the non-portable calls to a small subset of the program and/or to simply make reasonable guesses when putting the program together. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: multicore-multithreading benchmarks 2006-12-27 17:53 ` Chip Orange 2006-12-27 18:46 ` Jeffrey Creem @ 2006-12-27 19:51 ` tmoran 1 sibling, 0 replies; 9+ messages in thread From: tmoran @ 2006-12-27 19:51 UTC (permalink / raw) >there's no "standard" way in Ada to handle any of this? Does this seem like >a lack in Ada to you as it does to me, or am I missing some bigger picture? This is clearly very OS dependent. See below for an Ada program that gets the "affinity masks" on Windows. >With multicore/cpus being readily available, wouldn't you think any >programmer should be rewriting his applications to take advantage of this, >and so, would need this type of information to do so? Or should we, as a >general rule, just break all programming tasks into threads, assuming that >the OS will handle load balancing better than we could? Well, on dual-core, one of the cores can run the anti-virus while the other does useful work. ;) Seriously, if the OS knows more than you do about the demands on the machine, it can, potentially, do a better job of allocating resources. If you know more about the requirements than it does, then you should be able to do the better job. Also, most of today's machines are still single CPU, and tomorrow they may have a thousand CPUs, so it's hard to design a program that's optimized for all situations. (Note that Windows' "affinity mask" is limited to a max of 32 CPUs.) Judicious allocation of CPU resources with limited fore-knowledge is part of the program designer's job. Remember also that multi-tasking is not merely a speed-up technique, but is also a design tool. Some programs are more naturally written if you use multiple tasks, even if they run on single CPUs. with Ada.Text_IO, Interfaces.C; procedure Showaff is use type Interfaces.C.Unsigned; type Process_Handles is new Interfaces.C.Unsigned; type Masks is new Interfaces.C.Unsigned; type Bool is new Interfaces.C.Unsigned; package Hex_IO is new Ada.Text_IO.Modular_IO(Masks); function Get_Current_Process return Process_Handles; pragma Import(Stdcall, Get_Current_Process, "GetCurrentProcess"); function Get_Process_Affinity_Mask(Process : Process_Handles; Process_Mask, System_Mask: access Masks) return Bool; pragma Import(Stdcall, Get_Process_Affinity_Mask, "GetProcessAffinityMask"); Me : constant Process_Handles := Get_Current_Process; My_Mask, System_Mask: aliased Masks; begin if Get_Process_Affinity_Mask(Me, My_Mask'access, System_Mask'access) = 0 then Ada.Text_IO.Put_Line("GetProcessAffinityMask failed!"); else Ada.Text_IO.Put("my mask="); Hex_IO.Put(My_Mask, Base => 16); Ada.Text_IO.Put(" system mask="); Hex_IO.Put(System_Mask, Base => 16); Ada.Text_IO.New_Line; end if; end Showaff; ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2006-12-27 19:51 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-12-22 2:48 multicore-multithreading benchmarks tmoran 2006-12-22 6:14 ` tmoran 2006-12-22 17:12 ` Colin Paul Gloster 2006-12-23 6:50 ` tmoran 2006-12-26 21:58 ` Chip Orange 2006-12-27 8:57 ` tmoran 2006-12-27 17:53 ` Chip Orange 2006-12-27 18:46 ` Jeffrey Creem 2006-12-27 19:51 ` tmoran
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox