From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=unavailable autolearn_force=no version=3.4.4 Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail From: Brian Drummond Newsgroups: comp.lang.ada Subject: Re: Ada lacks lighterweight-than-task parallelism Date: Wed, 20 Jun 2018 12:28:22 -0000 (UTC) Organization: A noiseless patient Spider Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Date: Wed, 20 Jun 2018 12:28:22 -0000 (UTC) Injection-Info: reader02.eternal-september.org; posting-host="9ea83de6c635d947b3a4ffafe76c1710"; logging-data="9300"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+kW9ZoHhYj1GH9lZM9W5aeaZR+/QLL//s=" User-Agent: Pan/0.141 (Tarzan's Death; 168b179 git.gnome.org/pan2) Cancel-Lock: sha1:mhtogug622sVoiq1nlQEvrZJdv4= Xref: reader02.eternal-september.org comp.lang.ada:53198 Date: 2018-06-20T12:28:22+00:00 List-Id: On Tue, 19 Jun 2018 15:14:16 -0700, Dan'l Miller wrote: > http://www.theregister.co.uk/2018/06/18/microsoft_e2_edge_windows_10 > > As discussed in the article above, Microsoft is starting to unveil its > formerly-secret development of what could be described as “Itanium done > right“. wait what? ... JAN GRAY? (in the Further Reading section) breadcrumbs to https://arxiv.org/abs/1803.06617 "Design productivity is still a challenge for reconfigurable computing. It is expensive to port workloads into gates and to endure 10**2 to 10**4 second bitstream rebuild design itera- tions." ( ... no kidding, but tolerable where it improves program execution times below 10**6 or 10**7 seconds) so this is primarily work that emerged from the RC shadows, where for the past quarter century, people like JG have exploited parallelism not at the task level or even the "slice" level but at the gate level where that helps... and where one of the chief difficulties has been the interface between that (unconstrained) level and the tightly constrained (single operation stream from the compiler, reverse engineered into OO superscalar within the CPU) level and where some other efforts to smooth the way between parallelism domains are still ongoing... https://www.extremetech.com/computing/269461-intel-shows-off-xeon- scalable-gold-6138p-with-an-integrated-fpga https://www.nextplatform.com/2018/05/24/a-peek-inside-that-intel-xeon- fpga-hybrid-chip/ (I'm imagining on-chip PCIE links working like the old Transputer channels here, but streaming data to/from the bespoke hardware engine directly, much less overhead than I used to have, doing RC with external FPGA boards) ... if there are Ada dimensions here, one might be compiling Ada directly to hardware... https://www.cs.york.ac.uk/ftpdir/papers/rtspapers/R:Ward:2001.ps ...which paper is only slightly weakened by the fact that his published example procedure is also synthesisable VHDL! in fact Xilinx XST synthesises that example to run in a single clock cycle ... ... however, at an appallingly slow clock ... vs 732 cycles for the paper's result and an estimated 44,000 for an 80486. Thus the Ward paper's true merit is, ironically, that it allows automatic extraction of a degree of sequentialism from an inherently parallel example; opening the way to automatic generation of faster (and maybe smaller) pipelined dataflow hardware. Which is actually quite difficult, and historically an extensively manual process in past RC processes - another bottleneck in addition to the "bitstream rebuild" times JG complains about. so why stop at the "slice" level as EDGE does? it makes sense if there is automatic translation (compilation at usably fast rates) from source to that level, AND if a large and sufficiently generally useful structure of slices can be implemented in ASIC without the time and area penalty of FPGA routing. One way of looking at it is to see a "slice" as a higher-level or larger grained FPGA LUT (which is a generalisation of one or several gates). FPGAs have been becoming coarser grained anyway, as well as adding RAM Blocks and (first multiplier, then DSP primitive) blocks by the hundreds - because fewer more powerful primitive blocks reduce that routing (area + speed) penalty. A few dozen BlockRams for example, configured the right way, open the doors to allowing stack architectures to go supercalar (eliminating huge problems addressing registers), though I don't know if this has ever been exploited. interesting development... perhaps a logical growth from JG's involvement with the ultra fine grained XC6200 FPGA, where RC pretty much started. -- Brian