comp.lang.ada
 help / color / mirror / Atom feed
* puzzled re hyperthreaded performance
@ 2005-09-15  5:10 tmoran
  2005-09-15  6:08 ` jtg
  2005-09-19 20:21 ` Wiljan Derks
  0 siblings, 2 replies; 7+ messages in thread
From: tmoran @ 2005-09-15  5:10 UTC (permalink / raw)


  I have a multitasking program that runs faster if I insert a "delay d;"
statement to prevent some of the multitasking overlap.
  This is a version of the Shootout k-nucleotide program, which involves
a series of create/fill/lookup in hash tables, using the Ada 95 version
of Ada.Containers.Indefinite_Hashed_Maps  I let the operations run in
two tasks, each with their own hash table.  Memory requirement are minor.
    Alt_Task.Start;
    delay d;
    Compute;
    Alt_Task.Finish;
Alt_Task performs the same computations as Compute, reading the same
String, but sharing no other data.  There are no obvious task interactions,
and I don't see any in Ada.Containers
  For every additional 1.0 seconds of "d", the sequence runs about 1.5
seconds faster.  This is Gnat 3.15p on W2k on a hyperthreaded Pentium
(two virtual CPUs).
  Any suggestions what's going on?



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: puzzled re hyperthreaded performance
  2005-09-15  5:10 puzzled re hyperthreaded performance tmoran
@ 2005-09-15  6:08 ` jtg
  2005-09-15 12:13   ` keld_nielsen_4nulspam
  2005-09-19 20:21 ` Wiljan Derks
  1 sibling, 1 reply; 7+ messages in thread
From: jtg @ 2005-09-15  6:08 UTC (permalink / raw)


tmoran@acm.org wrote:
>   I have a multitasking program that runs faster if I insert a "delay d;"
> statement to prevent some of the multitasking overlap.
>   This is a version of the Shootout k-nucleotide program, which involves
> a series of create/fill/lookup in hash tables, using the Ada 95 version
> of Ada.Containers.Indefinite_Hashed_Maps  I let the operations run in
> two tasks, each with their own hash table.  Memory requirement are minor.
>     Alt_Task.Start;
>     delay d;
>     Compute;
>     Alt_Task.Finish;
> Alt_Task performs the same computations as Compute, reading the same
> String, but sharing no other data.  There are no obvious task interactions,
> and I don't see any in Ada.Containers
>   For every additional 1.0 seconds of "d", the sequence runs about 1.5
> seconds faster.  This is Gnat 3.15p on W2k on a hyperthreaded Pentium
> (two virtual CPUs).
>   Any suggestions what's going on?

For every second of parallel processing 1.5 second is lost.
So parallel processing slows the computations 2.5 times!
There may be several reasons, more or less obvious.
I guess the most important may be inefficient cache usage.
You may try to perform both operations in sequence, or make them
share the same hash table (if they perform the same computations on
the same data), or split the computations in other way (one
task with hash table and one for calculations only).
Other factor may be task switching, but I can't believe it can
slow the processing so hard!

By the way, when you have "proper" multiprocessor environment, where each
processor has its own cache, in certain circumstances you may observe the
opposite phenomenon: for example problem solving on two processors can run
MORE than two times faster.

I suggest you run a test in some "proper" multiprocessor environment,
i.e. two processors or a true multicore processor (with separate caches),
but not virtual MP.

One more thing: ensure Alt_Task is not terminated on Finish, but
the computations are completed.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: puzzled re hyperthreaded performance
  2005-09-15  6:08 ` jtg
@ 2005-09-15 12:13   ` keld_nielsen_4nulspam
  2005-09-15 20:44     ` tmoran
  0 siblings, 1 reply; 7+ messages in thread
From: keld_nielsen_4nulspam @ 2005-09-15 12:13 UTC (permalink / raw)


jtg ha scritto:

> There may be several reasons, more or less obvious.
> I guess the most important may be inefficient cache usage.
>

We have done some testing on the topic too, and our view is that speed
differences are due to "memory access" in hyperthreading mode compared
to conventional cpu's.

Her is a brief version of what we did. We were working on linear
algebra systems, i.e. the classical A*x=b matrix system, and for
testing it was solved with both Jacobi and Gauss_Seidel. Jacobi is
banal to parallelize, and a multithread program in Ada was tested => No
apparent difference in speed.

My friend Orazio then wrote a Jacobi version (in f77 - sorry for that
;-) with "manual loop unrolling" or two statements for i and i+1 inside
the loop => Speed was doubled!

In the end, to check for classical parallel race/timing conditions, the
Gauss-Seidel was implemented. => Again double speed and no race/timing
conditions observed.

Our conclusion is, that the single cpu works just as one, but the "bus
is boosted" with a factor two.

Now coming back to the threading and speed improvements above, in my
view it is a matter of which trick to use for getting the bus to work
in a boosted mode. We obtained it with simple loop unrolling, you
report threading. Are there other experiences out there?




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: puzzled re hyperthreaded performance
  2005-09-15 12:13   ` keld_nielsen_4nulspam
@ 2005-09-15 20:44     ` tmoran
  2005-09-15 21:43       ` tmoran
  0 siblings, 1 reply; 7+ messages in thread
From: tmoran @ 2005-09-15 20:44 UTC (permalink / raw)


> Now coming back to the threading and speed improvements above, in my
> view it is a matter of which trick to use for getting the bus to work
> in a boosted mode. We obtained it with simple loop unrolling, you
> report threading. Are there other experiences out there?
    Loop unrolling sounds like the machine was able to use multiple
arithmetic units to parallelize work at the sub-instruction level.  If
that happens then perhaps there are no extra resources for the second
virtual CPU, so there would be no performance speedup from
multi-threading, but I'm certainly not expert in the internals of Intel's
hyperthreading etc.  I wouldn't expect an significant slowdown however.
    Cache thrashing certainly comes to mind as a possible problem with
multi-threading.  But the program at hand is small.  I cut the problem
size so there are at most two simultaneous hash tables of Length 4091 for
6 character Strings, which, assuming 20 bytes/entry would be 2*4000*20 or
160K, plus 50K input String = 210K which I would think would fit well
inside the cache without thrashing.  But it's still slower (about 10%) to
run multithreaded than sequential.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: puzzled re hyperthreaded performance
  2005-09-15 20:44     ` tmoran
@ 2005-09-15 21:43       ` tmoran
  0 siblings, 0 replies; 7+ messages in thread
From: tmoran @ 2005-09-15 21:43 UTC (permalink / raw)


Further data:

>     Cache thrashing certainly comes to mind as a possible problem with
> multi-threading.  But the program at hand is small.  I cut the problem
> size so there are at most two simultaneous hash tables of Length 4091 for
> 6 character Strings, which, assuming 20 bytes/entry would be 2*4000*20 or
> 160K, plus 50K input String = 210K which I would think would fit well
> inside the cache without thrashing.  But it's still slower (about 10%) to
> run multithreaded than sequential.
   Since this used exclusively 6 character Strings, I changed from
Ada.Containers.Indefinite_Hashed_Maps to
Ada.Containers.Hashed_Maps
Now the parallel version runs 30% faster than the sequential one.
Increasing the string (and thus hash table) size to more probably
thrash the cache brings it back to parallel being slower than sequential.
So part of the problem is cache thrashing, but there's still some
remaining problem when using Indefinite_Hashed_Maps.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: puzzled re hyperthreaded performance
  2005-09-15  5:10 puzzled re hyperthreaded performance tmoran
  2005-09-15  6:08 ` jtg
@ 2005-09-19 20:21 ` Wiljan Derks
  2005-09-20  0:26   ` tmoran
  1 sibling, 1 reply; 7+ messages in thread
From: Wiljan Derks @ 2005-09-19 20:21 UTC (permalink / raw)


I did some performance testing on hyperthreaded pentium 2.8 using windows 
XP.
My conclusion was amazing:
* Hyperhreading works: it is as if things are executed in parallel
* Typically when only one thead is active, it can almost do the same amount 
of work
   compared to the situation with hyperthreading disabled.
* When two threads are active, the total amount of work done is certainly 
not more
   then having one thread is active.
* With hypertreading disabled, there seems to be more total performace.
Thus basically hyperthreading can makes a system more responsive, but not 
faster in cpu performance.
So it is not like a multiprocessor system.
I think the hypertreading is more like multiplexing the cpu over different 
threads without the
cost of taskswitching but t does not give you additional CPU performance.
This effect can make measurements very confusing.

I whould like to know if anyone has a different experience.





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: puzzled re hyperthreaded performance
  2005-09-19 20:21 ` Wiljan Derks
@ 2005-09-20  0:26   ` tmoran
  0 siblings, 0 replies; 7+ messages in thread
From: tmoran @ 2005-09-20  0:26 UTC (permalink / raw)


> I think the hypertreading is more like multiplexing the cpu over different
> threads without the
> cost of taskswitching but t does not give you additional CPU performance.
> This effect can make measurements very confusing.
>
> I whould like to know if anyone has a different experience.
  Running a sort on 100K floats (so about 400K bytes) I find a substantial
speedup (about 30%) if it's split into two simultaneous tasks, the main
program and an Ada task.
  Using Ada.Containers.Hashed_Maps to find all the n-character strings in
a given 50K string of just 4 letters (simulated DNA) with the main
program doing it for n=12 and the extra Ada task for n=18, there is no
improvement, and in fact a loss.  But the same program run with n=6 and
n=8 does show significant speedup.  In the former case, there are roughly
50K strings of the indicated sizes, so the hash table presumably has about
50000*18 bytes of key data or roughly 1MB, so the second task, which needs
another 1MB hash table, is probably fighting over the processor's total 1MB
cache space, while in the latter case there is probably room for both in
cache.  If I switch to Ada.Containers.Indefinite_Hashed_Maps, then even
the small one shows no multitasking speedup.  I haven't analyzed
Indefinite_Hashed_Maps to see why that might be the case.
  Running a word counting program, there was a substantial speedup by
processing alternate buffer loads via the environment task and a second
task.  In that case there was probably no cache contention, and one
task was probably processing its bufferload while the other was doing
IO (which probably did not come off disk except on the first run of
the program).
  The above is on a hyperthreaded Pentium under Windows 2000. YMMV



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-09-20  0:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-09-15  5:10 puzzled re hyperthreaded performance tmoran
2005-09-15  6:08 ` jtg
2005-09-15 12:13   ` keld_nielsen_4nulspam
2005-09-15 20:44     ` tmoran
2005-09-15 21:43       ` tmoran
2005-09-19 20:21 ` Wiljan Derks
2005-09-20  0:26   ` tmoran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox