From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII Path: g2news1.google.com!postnews.google.com!y17g2000yqd.googlegroups.com!not-for-mail From: KarlNyberg Newsgroups: comp.lang.ada Subject: Re: Multicore problem Date: Tue, 9 Mar 2010 14:37:02 -0800 (PST) Organization: http://groups.google.com Message-ID: References: <80b7542b-38d7-4da2-aa50-1c734b3c516c@q2g2000pre.googlegroups.com> <4b968f0e$0$1996$4f793bc4@news.tdc.fi> <71a2ea77-f97d-4e48-ba82-8322974f5c2b@s36g2000prh.googlegroups.com> NNTP-Posting-Host: 24.125.57.14 Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: posting.google.com 1268174222 3789 127.0.0.1 (9 Mar 2010 22:37:02 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Tue, 9 Mar 2010 22:37:02 +0000 (UTC) Complaints-To: groups-abuse@google.com Injection-Info: y17g2000yqd.googlegroups.com; posting-host=24.125.57.14; posting-account=TpjGSQoAAADyZ3j7ukab-gYBiUeRSw8c User-Agent: G2/1.0 X-HTTP-UserAgent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8 (.NET CLR 3.5.30729),gzip(gfe),gzip(gfe) Xref: g2news1.google.com comp.lang.ada:9501 Date: 2010-03-09T14:37:02-08:00 List-Id: On Mar 9, 1:48=A0pm, Anatoly Chernyshev wrote: > > Perhaps the two tasks/cores are competing for data-cache space, and > > causing more cache misses than if a single task/core is active? > > > How large is your 3D array, compared to the cache? > > Right now it is small, 23 Mb, for the purposes of algorithm honing. > 2Mb L2 cache. In perspective the array could be 4-8 Gb (not for my > current machine, of course). > > > If this is the problem, you could try to divide the whole problem into > > slices so that each slice processes a part of the 3D array that *can* > > fit in the cache, but can also be processed in parallel by two > > tasks/cores. Not knowing what your processing actually does I can't say > > more about how to slice the problem. > > Each task begins in the cell of its designated area of array, and then > does a Markov walk all around the place using the cell pointers. We recently had a thread on this kind of stuff on linkedin. Things like cache size and communications overhead (tip of the hat to Fred Brooks and the Mythical Man-Month here) all come into play. LinkedIn Groups Group: Multicore & Parallel Computing Subject: New comment (7) on "Why does speedup decrease with the number of threads?" In a system such as an Intel X86, you can execute several instructions per clock cycle. However, it takes several cycles to read or write data to L1 cache, more cycles to read/write to L2 cache, and even more cycles to read/write to L3 cache. If you have to read from memory, you might have to wait hundreds if not thousands of cycles for the read to occur. While the CPU is waiting for the read to complete from L1/L2/L3/ main memory, the CPU cannot execute. Also, each memory reference must have a valid TLB (translation lookaside buffer) to translate the virtual address to a physical address. I don't know what Windows 7 uses for a page size, but most operating systems since the 1970's use a 4k page size, and you might have 64 TLBs per core. You need to execute a system routine every time you get a TLB miss, and there are 100's of instructions in the TLB miss handler, as well as the supervisor context switch. Think of memory like a disk drive. With DRAM, first you have to seek to the page in the memory device, then you need to access the block that you want, then you transfer that block into the CPU cache. There is a memory controller that serializes requests to memory between the CPU and external devices such as the GPU, disk, etc. Now what happens when you go from 1 to 2 to 4 to 8 cores? Well, instead of 1 cpu making requests on memory, you have 2 or 4 or 8 CPUs making requests on memory. The memory controller needs to queue these requests and operate on them one at a time. So when you go from 2 cores to 4 cores, you can double the size of the queue waiting for a memory read to occur. If 1 read takes 1000 cycles, and you now have a queue 4 deep, that 1 read can now wait for 4000 cycles worst case, or 2000 on average. With 8 cores, you can now have a queue 8 deep when each core issues a memory read request. Don't forget that your graphics card will be DMAing to and from DRAM as well, so this can cause memory delays, and you may have other I/O devices, and other processes running in your system such as virus scanner, email, ... You can also have issues if you do not align the data between cores on cache line boundaries. If 2 cores try and access data in the same cache line, their respective caches can fight over the data when there is a write request, as only 1 core can 'own' a cache line, and it is expensive to transfer a cache line between cores. Also, when you use 1 core, it has full access to the shared L2 and shared L3 cache. I have seen in Intel architecture where cores 0 and 1 share an L2 cache, and cores 2 and 3 share another L2 cache. So in fact, you may get better results using cores 0 and 2 when running 2 cores, instead of using cores 0 and 1. When you just use core 0, it has use of the full shared L2 cache. When you start using core 1, this core will also start using the L2 cache that is shared with core 0. This means that data that core 0 needs will be evicted by core 1 and vice versa, thus causing both cores to slow down as the cores stall waiting for the cache to be refilled (thrashing). Note that every time the operating system does a task switch, the contents of the TLBs must be flushed from the old task context, and regenerated for the new task. This is an expensive operation if you are not doing much computation in your task. Also, the cache may need to be refilled based on the new task address space. When you go to use multiple cores, you should lock a specific task to a specific core, as it is expensive for a process to move between cores. Also, when doing this sort of computation, you should be making use of the SIMD instruction set, where you can operate on 4 or 8 or more data items in 1 instruction per core. Hyperthreading just means you have virtual cores. It is a mechanism to keep the physical core busy executing instructions while a L1/L2/L3/ memory transfer occurs. If you program correctly, you should be able to fully utilize the power of the virtual cores. Posted by Paul Burega