Multicore problem

comp.lang.ada
 help / color / mirror / Atom feed

* Multicore problem
@ 2010-03-09 17:48 Anatoly Chernyshev
  2010-03-09 18:10 ` Niklas Holsti
  0 siblings, 1 reply; 8+ messages in thread
From: Anatoly Chernyshev @ 2010-03-09 17:48 UTC (permalink / raw)


Hello, everyone,

Here is the following problem: large 3D array, which can be processed
in parallel. Naturally, I was trying to make use of 2 cores on my
processor by splitting the whole calculation into several parallel
tasks. Strangely, I could actually see that the program utilizes both
cores now (even not by 100% as I wished), but the computation time
increases more than 2 times when there are 2 tasks are running on 2
cores. During calculation any 2 or more tasks can occasionally modify
the same array member, for which a protected type is introduced.

Here are results of benchmarking for my 2 core system (Intel P4 Dual
Core, Win XP SP3):

1 task (50% total proc. load): 680 sec.
2 tasks (70-93% load): 1520 sec.
4 tasks (70-93% load): 1195 sec.

Any ideas on why it might happen and how to get around it?

Thank you.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Multicore problem
  2010-03-09 17:48 Multicore problem Anatoly Chernyshev
@ 2010-03-09 18:10 ` Niklas Holsti
  2010-03-09 18:48   ` Anatoly Chernyshev
  0 siblings, 1 reply; 8+ messages in thread
From: Niklas Holsti @ 2010-03-09 18:10 UTC (permalink / raw)


Anatoly Chernyshev wrote:
> Hello, everyone,
> 
> Here is the following problem: large 3D array, which can be processed
> in parallel. Naturally, I was trying to make use of 2 cores on my
> processor by splitting the whole calculation into several parallel
> tasks. Strangely, I could actually see that the program utilizes both
> cores now (even not by 100% as I wished), but the computation time
> increases more than 2 times when there are 2 tasks are running on 2
> cores. During calculation any 2 or more tasks can occasionally modify
> the same array member, for which a protected type is introduced.
> 
> Here are results of benchmarking for my 2 core system (Intel P4 Dual
> Core, Win XP SP3):
> 
> 1 task (50% total proc. load): 680 sec.
> 2 tasks (70-93% load): 1520 sec.
> 4 tasks (70-93% load): 1195 sec.
> 
> Any ideas on why it might happen and how to get around it?

Perhaps the two tasks/cores are competing for data-cache space, and 
causing more cache misses than if a single task/core is active?

How large is your 3D array, compared to the cache?

If this is the problem, you could try to divide the whole problem into 
slices so that each slice processes a part of the 3D array that *can* 
fit in the cache, but can also be processed in parallel by two 
tasks/cores. Not knowing what your processing actually does I can't say 
more about how to slice the problem.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Multicore problem
  2010-03-09 18:10 ` Niklas Holsti
@ 2010-03-09 18:48   ` Anatoly Chernyshev
  2010-03-09 22:37     ` KarlNyberg
  2010-03-10  0:08     ` jonathan
  0 siblings, 2 replies; 8+ messages in thread
From: Anatoly Chernyshev @ 2010-03-09 18:48 UTC (permalink / raw)


> Perhaps the two tasks/cores are competing for data-cache space, and
> causing more cache misses than if a single task/core is active?
>
> How large is your 3D array, compared to the cache?

Right now it is small, 23 Mb, for the purposes of algorithm honing.
2Mb L2 cache. In perspective the array could be 4-8 Gb (not for my
current machine, of course).

> If this is the problem, you could try to divide the whole problem into
> slices so that each slice processes a part of the 3D array that *can*
> fit in the cache, but can also be processed in parallel by two
> tasks/cores. Not knowing what your processing actually does I can't say
> more about how to slice the problem.

Each task begins in the cell of its designated area of array, and then
does a Markov walk all around the place using the cell pointers.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Multicore problem
  2010-03-09 18:48   ` Anatoly Chernyshev
@ 2010-03-09 22:37     ` KarlNyberg
  2010-03-10  0:08     ` jonathan
  1 sibling, 0 replies; 8+ messages in thread
From: KarlNyberg @ 2010-03-09 22:37 UTC (permalink / raw)

On Mar 9, 1:48 pm, Anatoly Chernyshev <achernys...@gmail.com> wrote:
> > Perhaps the two tasks/cores are competing for data-cache space, and
> > causing more cache misses than if a single task/core is active?
>
> > How large is your 3D array, compared to the cache?
>
> Right now it is small, 23 Mb, for the purposes of algorithm honing.
> 2Mb L2 cache. In perspective the array could be 4-8 Gb (not for my
> current machine, of course).
>
> > If this is the problem, you could try to divide the whole problem into
> > slices so that each slice processes a part of the 3D array that *can*
> > fit in the cache, but can also be processed in parallel by two
> > tasks/cores. Not knowing what your processing actually does I can't say
> > more about how to slice the problem.
>
> Each task begins in the cell of its designated area of array, and then
> does a Markov walk all around the place using the cell pointers.

We recently had a thread on this kind of stuff on linkedin.  Things
like cache size and communications overhead (tip of the hat to Fred
Brooks and the Mythical Man-Month here) all come into play.

LinkedIn Groups
Group: Multicore & Parallel Computing
Subject: New comment (7) on "Why does speedup decrease with the number
of threads?"
In a system such as an Intel X86, you can execute several instructions
per clock cycle. However, it takes several cycles to read or write
data to L1 cache, more cycles to read/write to L2 cache, and even more
cycles to read/write to L3 cache. If you have to read from memory, you
might have to wait hundreds if not thousands of cycles for the read to
occur. While the CPU is waiting for the read to complete from L1/L2/L3/
main memory, the CPU cannot execute. Also, each memory reference must
have a valid TLB (translation lookaside buffer) to translate the
virtual address to a physical address. I don't know what Windows 7
uses for a page size, but most operating systems since the 1970's use
a 4k page size, and you might have 64 TLBs per core. You need to
execute a system routine every time you get a TLB miss, and there are
100's of instructions in the TLB miss handler, as well as the
supervisor context switch.

Think of memory like a disk drive. With DRAM, first you have to seek
to the page in the memory device, then you need to access the block
that you want, then you transfer that block into the CPU cache. There
is a memory controller that serializes requests to memory between the
CPU and external devices such as the GPU, disk, etc.

Now what happens when you go from 1 to 2 to 4 to 8 cores? Well,
instead of 1 cpu making requests on memory, you have 2 or 4 or 8 CPUs
making requests on memory. The memory controller needs to queue these
requests and operate on them one at a time. So when you go from 2
cores to 4 cores, you can double the size of the queue waiting for a
memory read to occur. If 1 read takes 1000 cycles, and you now have a
queue 4 deep, that 1 read can now wait for 4000 cycles worst case, or
2000 on average. With 8 cores, you can now have a queue 8 deep when
each core issues a memory read request. Don't forget that your
graphics card will be DMAing to and from DRAM as well, so this can
cause memory delays, and you may have other I/O devices, and other
processes running in your system such as virus scanner, email, ...

You can also have issues if you do not align the data between cores on
cache line boundaries. If 2 cores try and access data in the same
cache line, their respective caches can fight over the data when there
is a write request, as only 1 core can 'own' a cache line, and it is
expensive to transfer a cache line between cores.

Also, when you use 1 core, it has full access to the shared L2 and
shared L3 cache. I have seen in Intel architecture where cores 0 and 1
share an L2 cache, and cores 2 and 3 share another L2 cache. So in
fact, you may get better results using cores 0 and 2 when running 2
cores, instead of using cores 0 and 1. When you just use core 0, it
has use of the full shared L2 cache. When you start using core 1, this
core will also start using the L2 cache that is shared with core 0.
This means that data that core 0 needs will be evicted by core 1 and
vice versa, thus causing both cores to slow down as the cores stall
waiting for the cache to be refilled (thrashing).

Note that every time the operating system does a task switch, the
contents of the TLBs must be flushed from the old task context, and
regenerated for the new task. This is an expensive operation if you
are not doing much computation in your task. Also, the cache may need
to be refilled based on the new task address space.

When you go to use multiple cores, you should lock a specific task to
a specific core, as it is expensive for a process to move between
cores.

Also, when doing this sort of computation, you should be making use of
the SIMD instruction set, where you can operate on 4 or 8 or more data
items in 1 instruction per core.

Hyperthreading just means you have virtual cores. It is a mechanism to
keep the physical core busy executing instructions while a L1/L2/L3/
memory transfer occurs. If you program correctly, you should be able
to fully utilize the power of the virtual cores.

Posted by Paul Burega

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Multicore problem
  2010-03-09 18:48   ` Anatoly Chernyshev
  2010-03-09 22:37     ` KarlNyberg
@ 2010-03-10  0:08     ` jonathan
  2010-03-10  7:19       ` Anatoly Chernyshev
  1 sibling, 1 reply; 8+ messages in thread
From: jonathan @ 2010-03-10  0:08 UTC (permalink / raw)

On Mar 9, 6:48 pm, Anatoly Chernyshev <achernys...@gmail.com> wrote:
> > Perhaps the two tasks/cores are competing for data-cache space, and
> > causing more cache misses than if a single task/core is active?
>
> > How large is your 3D array, compared to the cache?
>
> Right now it is small, 23 Mb, for the purposes of algorithm honing.
> 2Mb L2 cache. In perspective the array could be 4-8 Gb (not for my
> current machine, of course).
>
> > If this is the problem, you could try to divide the whole problem into
> > slices so that each slice processes a part of the 3D array that *can*
> > fit in the cache, but can also be processed in parallel by two
> > tasks/cores. Not knowing what your processing actually does I can't say
> > more about how to slice the problem.
>
> Each task begins in the cell of its designated area of array, and then
> does a Markov walk all around the place using the cell pointers.

Of course I can't say much about your specific problem.  I can tell
you
(the little) I learned battling something similar (using linux and
GNAT).  The problem I was up against was latency: minimum time to send
data, no matter how small the data, between 2 cores on the same
processor.
I measured this time as about 40 microseconds, (exchange of data
between
cores). Might have been 20 or 60, but think 10's of microseconds.
If message passing (or sharing data in an array) is more frequent
than this, you'll never see a speed up on the several cores.
(Some expensive networks advertise
latencies of a few microseconds, and some experimental operating
systems advertise sub-microsecond latencies.) So you should not
think of passing data more frequently than say every 200 or 400
microseconds if you want to see speedup.  Which is a long
time .. you can do 10's of thousands of floating point
operations in that time!

Jonathan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Multicore problem
  2010-03-10  0:08     ` jonathan
@ 2010-03-10  7:19       ` Anatoly Chernyshev
  2010-03-10 18:00         ` Charmed Snark
  2010-03-10 19:24         ` tmoran
  0 siblings, 2 replies; 8+ messages in thread
From: Anatoly Chernyshev @ 2010-03-10  7:19 UTC (permalink / raw)


Thanks a lot to all responded. What is clear now is that the problem
is rather complicated. I guess, I'd better be off with distributing
the calculation over several PCs, with one task on each.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Multicore problem
  2010-03-10  7:19       ` Anatoly Chernyshev
@ 2010-03-10 18:00         ` Charmed Snark
  2010-03-10 19:24         ` tmoran
  1 sibling, 0 replies; 8+ messages in thread
From: Charmed Snark @ 2010-03-10 18:00 UTC (permalink / raw)


Anatoly Chernyshev expounded in news:095dc775-e214-4ccd-bf26-45ab27b3b277
@s36g2000prh.googlegroups.com:

> Thanks a lot to all responded. What is clear now is that the problem
> is rather complicated. I guess, I'd better be off with distributing
> the calculation over several PCs, with one task on each.

I was wondering about these issues in connection with
pthreads. I have a basic interpreter project
(https://sourceforge.net/projects/bdbbasic) where I
was considering implementing matrix operations in 
multiple threads. 

The net wisdom suggested that the thread startup 
cost was around 100usec as a general rule. However,
that appears to be only the beginning of important
considerations. ;-)

From the related post here, it seems that the matrix 
might need to be huge to be of benefit. But even then,
as the cores and cache interact, this might not improve
performance.

Interesting stuff indeed. Then when you add this to
multiple platforms, a project designer really gets
his hands dirty..

Warren



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Multicore problem
  2010-03-10  7:19       ` Anatoly Chernyshev
  2010-03-10 18:00         ` Charmed Snark
@ 2010-03-10 19:24         ` tmoran
  1 sibling, 0 replies; 8+ messages in thread
From: tmoran @ 2010-03-10 19:24 UTC (permalink / raw)


>Thanks a lot to all responded. What is clear now is that the problem
>is rather complicated.
  In the early days of virtual memory, one of our users (at U of
Wisconsin) decided to make use of the large memory by declaring large
arrays, which his program accessed by running down columns.
Unfortunately, the compiler allocated them by rows, so his program's
thrashing brought the whole system to a crawl.  I imagine there will
have to be a similar learning process for using multicores well.



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-03-10 19:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-09 17:48 Multicore problem Anatoly Chernyshev
2010-03-09 18:10 ` Niklas Holsti
2010-03-09 18:48   ` Anatoly Chernyshev
2010-03-09 22:37     ` KarlNyberg
2010-03-10  0:08     ` jonathan
2010-03-10  7:19       ` Anatoly Chernyshev
2010-03-10 18:00         ` Charmed Snark
2010-03-10 19:24         ` tmoran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox