From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.13.207.129 with SMTP id r123mr22353689ywd.66.1494285881977; Mon, 08 May 2017 16:24:41 -0700 (PDT) X-Received: by 10.157.31.68 with SMTP id x4mr1307047otx.19.1494285881924; Mon, 08 May 2017 16:24:41 -0700 (PDT) Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!news.glorb.com!t26no454783qtg.1!news-out.google.com!v18ni2377ita.0!nntp.google.com!c26no1602498itd.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Mon, 8 May 2017 16:24:41 -0700 (PDT) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=2601:191:8303:2100:5985:2c17:9409:aa9c; posting-account=fdRd8woAAADTIlxCu9FgvDrUK4wPzvy3 NNTP-Posting-Host: 2601:191:8303:2100:5985:2c17:9409:aa9c References: <0fc56bf7-1cfa-4776-9c47-a573db315c5f@googlegroups.com> <7b0c08eb-be62-4d14-ae99-cad038ad0a62@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <077e7f6a-5a7b-4b88-a16f-7672aec18a17@googlegroups.com> Subject: Re: Portable memory barrier? From: Robert Eachus Injection-Date: Mon, 08 May 2017 23:24:41 +0000 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Xref: news.eternal-september.org comp.lang.ada:46712 Date: 2017-05-08T16:24:41-07:00 List-Id: On Monday, May 8, 2017 at 5:09:45 PM UTC-4, Dmitry A. Kazakov wrote: > This strategy does not prevent reordering writes: >=20 > FIFO (Write_Index + 1) :=3D Element; > Write_Index :=3D Write_Index + 1; >=20 > into >=20 > Write_Index :=3D Write_Index + 1; > FIFO (Write_Index) :=3D Element; >=20 > which would be a bug. You are UNDERthinking it. There is a situation which can occur, and progra= mmers should be wary of. (Skip to before the footnotes if you don't care ab= out details.) What happens is that the code is broken into: r1 :=3D Write_Index; ... r2 :=3D r1 x Element'Length; r3 :=3D r2 + FIFO'First; r4 :=3D Element; (r3) :=3D r4; r1 :=3D r1 + 1; (Write_Index'Address) :=3D r1; (Parentheses write to the address.) The compiler will play some games, and certainly will leave Write_Index in = a register if possible. But with each of these assignments, there is a con= text consisting of its set of registers. So the two assignments to r1 go t= o different (renaming) registers. We dpn't care what the CPU calls them, as= long as it keeps track of which r1 belongs to which contexts. Now the CPU can reorder to its heart's content, as long as it doesn't try t= o use a value before it has been computed.* Old values? No problem, they = are available in renamed registers. The value of Write_Index (in r1) when = the write to FIFO completes? Who cares, it was used to compute the correct= address earlier. How does the CPU know when a value in a register is no longer needed? When = all of the contexts containing that value of the register have been used. = Count every use of the value until the register is changed, and, of course,= reduce the count on use. The counts are not unbounded because you only co= unt for current residents of the reordering buffers (ROBs) as micro-ops are= parsed and loaded in. What if you need more registers than you have in th= e renaming list? Stall instruction parsing until one is freed up. Almost = never happens. Even though the AMD64 ISA has 16 integer register names** t= hat can appear in a program, CPUs have about a hundred renaming registers p= er core. Recent architectures, Intel's Broadwell and AMD's Zen have 168 en= try integer register files. (This means 1344 total in an eight core chip. = Note that the renaming registers are shared between two threads per core. = There are only 160 floating-point renaming registers in Zen--1280 total. T= his is eight per core shy of Intel's Skylake and Kaby Lake.) So what can happen that is a potential problem? All those renaming registe= rs, mean that the most recent written value of Write_Index can trail writes= to the FIFO queue. Normally this will only happen when you inline code, a= nd there are multiple instances of the write routine inlined into a single = (callable) procedure or function. Now you can have several values in the F= IFO which will not yet be readable. As long as you are not worried about (= nanosecond) latency, not an issue. Worst case, the same thing happens on t= he read side, and the writing side needs to wait. Being a former real-time= programmer, you don't want the writes getting thrown on the floor. You can have the same problem if you set the write side at a higher priorit= y than the read side on a single core no multithreading CPU. In either cas= e, the solution is the same: Make the FIFO large enough that such collisio= ns (between the two indexes) don't happen. This is the sort of issue that = the proof of correctness folk don't see. You can (and should) test for thi= s in any hard real-time system, no matter how much theorem proving you do. = Assuming you do it right, now you have average and maximum queuing latenci= es to feed back into your theorem prover. I'm not against theorem proving,= I just know from experience that analysis and testing is also needed. ;-) * Technically use a value before the pipe computing it has it ready for use= . Due to the nature of pipelined architectures, it matters not if that val= ue is not ready until stage 10, if the instruction that needs it won't need= it until (its) stage 9. The CPU sees that as those two instructions canno= t be dispatched on the same clock cycle, but otherwise the "store forwardin= g" makes it look like it only took one clock cycle to compute. If you--and= the CPU--are clever, you can even have the two micro-ops in the same pipe,= one after the other. If you are even more clever, you can choose micro-op= s such that they can execute on the same clock cycle. It is not that difficult to write the innermost loops of some programs to t= ake only one clock cycle on average per iteration. You have three or more = instructions executing together, as parts of the loop for iterations n, n-1= , and n-2. So any iteration of the loop takes two clock cycles plus the pi= peline depth, but the average iteration take very close to one clock cycle. ** Well 64-bit integer registers or parts of one, plus some other registers= that it makes sense to treat as just another integer register. Floating p= oint values have different register names, and their own large renaming reg= ister set.