From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: 103376,1a44c40a66c293f3
X-Google-Thread: 1089ad,7e78f469a06e6516
X-Google-Attributes: gid103376,gid1089ad,public
X-Google-Language: ENGLISH,ASCII
Path: 
 g2news2.google.com!news1.google.com!news.germany.com!feeder.news-service.com!216.196.110.148.MISMATCH!border1.nntp.ams.giganews.com!nntp.giganews.com!news-in.ntli.net!newsrout1-win.ntli.net!ntli.net!news.highwinds-media.com!newspeer1-win.ntli.net!newsfe2-win.ntli.net.POSTED!53ab2750!not-for-mail
From: "Dr. Adrian Wrigley" <amtw@linuxchip.demon.co.uk.uk.uk>
Subject: Re: Embedded languages based on early Ada (from "Re: Preferred OS,
 processor family for running embedded Ada?")
User-Agent: Pan/0.14.2 (This is not a psychotic episode. It's a cleansing
 moment of clarity.)
Message-Id: <pan.2007.03.03.21.28.56.950660@linuxchip.demon.co.uk.uk.uk>
Newsgroups: comp.lang.ada,comp.lang.vhdl
References: <pan.2007.02.24.22.11.44.430179@linuxchip.demon.co.uk.uk.uk>
 <u7iu68eba.fsf@stephe-leake.org> <es4dl5$ve2$1@newsserver.cilea.it>
 <pan.2007.03.01.11.23.01.229462@linuxchip.demon.co.uk.uk.uk>
 <113ls6wugt43q$.cwaeexcj166j$.dlg@40tude.net>
 <pan.2007.03.02.11.36.51.700020@linuxchip.demon.co.uk.uk.uk>
 <1i3drcyut9aaw.isde6utlv6iq.dlg@40tude.net>
 <pan.2007.03.03.00.01.49.334@linuxchip.demon.co.uk.uk.uk>
 <1j0a3kevqhqal.riuhe88py2tq$.dlg@40tude.net>
 <camiu2tgii5qvtscjnl84kgr5d8hunu0ej@4ax.com>
 <q95ob4-082.ln1@rimmer.farnz.org.uk>
 <pan.2007.03.03.14.08.34.569578@linuxchip.demon.co.uk.uk.uk>
 <45E9B032.60502@obry.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Date: Sat, 03 Mar 2007 21:28:23 GMT
NNTP-Posting-Host: 82.21.99.109
X-Trace: newsfe2-win.ntli.net 1172957303 82.21.99.109 (Sat,
 03 Mar 2007 21:28:23 GMT)
NNTP-Posting-Date: Sat, 03 Mar 2007 21:28:23 GMT
Organization: NTL
Xref: g2news2.google.com comp.lang.ada:9663 comp.lang.vhdl:7627
Date: 2007-03-03T21:28:23+00:00
List-Id: <comp.lang.ada>

On Sat, 03 Mar 2007 18:28:18 +0100, Pascal Obry wrote:

> Dr. Adrian Wrigley a �crit :
>> Numerous algorithms in simulation are "embarrassingly parallel",
>> but this fact is completely and deliberately obscured from compilers.
> 
> Not a big problem. If the algorithms are "embarrassingly parallel" then
> the jobs are fully independent. In this case that is quite simple,

They aren't independent in terms of cache use! They may also have
common subexpressions, which independent treatments re-evalutates.

> create as many tasks as you have of processors. No big deal. Each task
> will compute a specific job. Ada has no problem with "embarrassingly
> parallel" jobs.

A problem is it that it breaks the memory bandwidth budget.  This
approach is tricky with large numbers of processors.  And even more
challenging with hardware synthesis.

> What I have not yet understood is that people are trying to solve, in
> all cases, the parallelism at the lowest lever. Trying to parallelize an
> algorithm in an "embarrassingly parallel" context is loosing precious
> time.

You need to parallelise at the lowest level to take advantage of
hardware synthesis.  For normal threads a somewhat higher level
is desirable.  For multiple systems on a network, a high level
is needed.

What I want in a language is the ability to specify when things
must be evaluated sequentially, and when it doesn't matter
(even if the result of changing the order may differ).

> Many real case simulations have billions of those algorithm to
> compute on multiple data, just create a set of task to compute in
> parallel multiple of those algorithm. Easier and as effective.

Reasonable for compilers and processors as they are designed now.
Even so it can be challenging to take advantage of shared
calculations and memory capacity and bandwidth limitations.

But useless for hardware synthesis.  Or automated partitioning
software.  Or generating system diagrams from code. 

Manual partitioning into tasks and sequential code segments is
something which is not part of the problem domain, but part
of the solution domain.  It implies a multiplicity of sequentially
executing process threads.

Using concurrent statements in the source code is not the same thing
as "trying to parallelise an algorithm".  It doesn't lose any
prescious execution time.  It simply informs the reader and the
compiler that the order of certain actions isn't considered relevant.
The compiler can takes some parts of the source and convert to
a netlist for an ASIC or FPGA.  Other parts could be broken
down into threads.  Or maybe parts could be passed to separate
computer systems on a network.  Much of it could be ignored.
It is the compiler which tries to parallelise the execution.
Unlike tasks, where the programmer does try to parallelise.

Whose job is it to parallise operations?  Traditionally,
programmers try to specify exactly what sequence of operations is
to take place.  And then the compiler does its best to shuffle
things around (limited).  And the CPU tries to overlap data
fetch, calculation, address calculation by watching the
instruction sequence for concurrency opportunities.
Why do the work to force sequential operation if the
compiler and hardware are desperately trying to infer
concurrency?

> In other words, what I'm saying is that in some cases ("embarrassingly
> parallel" computation is one of them) it is easier to do n computations
> in n tasks than n x (1 parallel computation in n tasks), and the overall
> performance is better.

This is definitely the case.  And it helps explain why parallelisation
is not a job for the programmer or the hardware designer, but for
the synthesis tool, OS, processor, compiler or run-time.  Forcing
the programmer or hardware designer to hard-code a specific parallism type
(threads), and a particular partitioning, while denying the expressiveness
of a concurrent language will result in inferior flexibility and
inability to map the problem onto certain types of solution.

If all the parallelism your hardware has is a few threads then all you
need to code for is tasks.  If you want to be able to target FPGAs,
million-thread CPUs, ASICs and loosely coupled processor networks,
the Ada task model alone serves very poorly.

Perhaps mapping execution of a program onto threads or other
concurent structure is like mapping execution onto memory.
It *is* possible to manage a processor with a small, fast memory,
mapped at a fixed address range.  You use special calls to move
data to and from your main store, based on your own analysis of
how the memory access patterns will operate.  But this approach
has given way to automated caches with dynamic mapping of
memory cells to addresses.  And virtual memory.  Trying to
manage tasks "manually", based on your hunches about task
coherence and work load will surely give way to automatic
thread inference creation and management based on the interaction
of thread management hardware and OS support.  Building in
hunches about tasking to achieve parallelism can only be
a short-term solution.
 --
Adrian