From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00
	autolearn=unavailable autolearn_force=no version=3.4.4
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: Brian Drummond <brian@shapes.demon.co.uk>
Newsgroups: comp.lang.ada
Subject: Re: Ada lacks lighterweight-than-task parallelism
Date: Wed, 20 Jun 2018 12:28:22 -0000 (UTC)
Organization: A noiseless patient Spider
Message-ID: <pgdh96$92k$1@dont-email.me>
References: <e72534b1-17a7-40b5-92b9-01a4695e2743@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 20 Jun 2018 12:28:22 -0000 (UTC)
Injection-Info: reader02.eternal-september.org;
 posting-host="9ea83de6c635d947b3a4ffafe76c1710";
	logging-data="9300"; mail-complaints-to="abuse@eternal-september.org";
	posting-account="U2FsdGVkX1+kW9ZoHhYj1GH9lZM9W5aeaZR+/QLL//s="
User-Agent: Pan/0.141 (Tarzan's Death; 168b179 git.gnome.org/pan2)
Cancel-Lock: sha1:mhtogug622sVoiq1nlQEvrZJdv4=
Xref: reader02.eternal-september.org comp.lang.ada:53198
Date: 2018-06-20T12:28:22+00:00
List-Id: <comp.lang.ada>

On Tue, 19 Jun 2018 15:14:16 -0700, Dan'l Miller wrote:

> http://www.theregister.co.uk/2018/06/18/microsoft_e2_edge_windows_10
> 
> As discussed in the article above, Microsoft is starting to unveil its
> formerly-secret development of what could be described as “Itanium done
> right“.

wait what? ... JAN GRAY? 

(in the Further Reading section) breadcrumbs to
https://arxiv.org/abs/1803.06617

"Design productivity is still a challenge for reconfigurable
computing. It is expensive to port workloads into gates and
to endure 10**2 to 10**4 second bitstream rebuild design itera-
tions." 

( ... no kidding, but tolerable where it improves program execution times 
below 10**6 or 10**7 seconds)

so this is primarily work that emerged from the RC shadows, where for the 
past quarter century, people like JG have exploited parallelism not at 
the task level or even the "slice" level but at the gate level where that 
helps...

and where one of the chief difficulties has been the interface between 
that (unconstrained) level and the tightly constrained (single operation 
stream from the compiler, reverse engineered into OO superscalar within 
the CPU) level

and where some other efforts to smooth the way between parallelism 
domains are still ongoing...
https://www.extremetech.com/computing/269461-intel-shows-off-xeon-
scalable-gold-6138p-with-an-integrated-fpga
https://www.nextplatform.com/2018/05/24/a-peek-inside-that-intel-xeon-
fpga-hybrid-chip/
(I'm imagining on-chip PCIE links working like the old Transputer 
channels here, but streaming data to/from the bespoke hardware engine 
directly, much less overhead than I used to have, doing RC with external 
FPGA boards)

... if there are Ada dimensions here, one might be compiling Ada directly 
to hardware...

https://www.cs.york.ac.uk/ftpdir/papers/rtspapers/R:Ward:2001.ps

...which paper is only slightly weakened by the fact that his published 
example procedure is also synthesisable VHDL! in fact Xilinx XST 
synthesises that example to run in a single clock cycle ...

 ... however, at an appallingly slow clock ...

vs 732 cycles for the paper's result and an estimated 44,000 for an 80486.

Thus the Ward paper's true merit is, ironically, that it allows automatic 
extraction of a degree of sequentialism from an inherently parallel 
example; opening the way to automatic generation of faster (and maybe 
smaller) pipelined dataflow hardware. 

Which is actually quite difficult, and historically an extensively manual 
process in past RC processes - another bottleneck in addition to the 
"bitstream rebuild" times JG complains about.

so why stop at the "slice" level as EDGE does?

it makes sense if there is automatic translation (compilation at usably 
fast rates) from source to that level, AND if a large and sufficiently 
generally useful structure of slices can be implemented in ASIC without 
the time and area penalty of FPGA routing. 

One way of looking at it is to see a "slice" as a higher-level or larger 
grained FPGA LUT (which is a generalisation of one or several gates).

FPGAs have been becoming coarser grained anyway, as well as adding RAM 
Blocks and (first multiplier, then DSP primitive) blocks by the hundreds 
- because fewer more powerful primitive blocks reduce that routing (area 
+ speed) penalty.

 A few dozen BlockRams for example, configured the right way, open the 
doors to allowing stack architectures to go supercalar (eliminating huge 
problems addressing registers), though I don't know if this has ever been 
exploited.

interesting development... perhaps a logical growth from JG's involvement 
with the ultra fine grained XC6200 FPGA, where RC pretty much started.
-- Brian