From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.157.38.142 with SMTP id l14mr30279089otb.124.1494289801957; Mon, 08 May 2017 17:30:01 -0700 (PDT) X-Received: by 10.157.35.104 with SMTP id k37mr1313993otd.14.1494289801918; Mon, 08 May 2017 17:30:01 -0700 (PDT) Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!c26no1611698itd.0!news-out.google.com!v18ni2394ita.0!nntp.google.com!c26no1604841itd.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Mon, 8 May 2017 17:30:01 -0700 (PDT) In-Reply-To: <077e7f6a-5a7b-4b88-a16f-7672aec18a17@googlegroups.com> Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=173.71.201.205; posting-account=QF6XPQoAAABce2NyPxxDAaKdAkN6RgAf NNTP-Posting-Host: 173.71.201.205 References: <0fc56bf7-1cfa-4776-9c47-a573db315c5f@googlegroups.com> <7b0c08eb-be62-4d14-ae99-cad038ad0a62@googlegroups.com> <077e7f6a-5a7b-4b88-a16f-7672aec18a17@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <8a968aae-79e4-421b-ba4b-e0a9a33ce0db@googlegroups.com> Subject: Re: Portable memory barrier? From: Jere Injection-Date: Tue, 09 May 2017 00:30:01 +0000 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Xref: news.eternal-september.org comp.lang.ada:46714 Date: 2017-05-08T17:30:01-07:00 List-Id: On Monday, May 8, 2017 at 7:24:42 PM UTC-4, Robert Eachus wrote: > On Monday, May 8, 2017 at 5:09:45 PM UTC-4, Dmitry A. Kazakov wrote: >=20 > > This strategy does not prevent reordering writes: > >=20 > > FIFO (Write_Index + 1) :=3D Element; > > Write_Index :=3D Write_Index + 1; > >=20 > > into > >=20 > > Write_Index :=3D Write_Index + 1; > > FIFO (Write_Index) :=3D Element; > >=20 > > which would be a bug. >=20 > You are UNDERthinking it. There is a situation which can occur, and prog= rammers should be wary of. (Skip to before the footnotes if you don't care = about details.) >=20 > What happens is that the code is broken into: >=20 > r1 :=3D Write_Index; > ... > r2 :=3D r1 x Element'Length; > r3 :=3D r2 + FIFO'First; > r4 :=3D Element; > (r3) :=3D r4; > r1 :=3D r1 + 1; > (Write_Index'Address) :=3D r1; >=20 > (Parentheses write to the address.) >=20 > The compiler will play some games, and certainly will leave Write_Index i= n a register if possible. But with each of these assignments, there is a c= ontext consisting of its set of registers. So the two assignments to r1 go= to different (renaming) registers. We dpn't care what the CPU calls them, = as long as it keeps track of which r1 belongs to which contexts. >=20 > Now the CPU can reorder to its heart's content, as long as it doesn't try= to use a value before it has been computed.* Old values? No problem, the= y are available in renamed registers. The value of Write_Index (in r1) whe= n the write to FIFO completes? Who cares, it was used to compute the corre= ct address earlier. >=20 > How does the CPU know when a value in a register is no longer needed? Whe= n all of the contexts containing that value of the register have been used.= Count every use of the value until the register is changed, and, of cours= e, reduce the count on use. The counts are not unbounded because you only = count for current residents of the reordering buffers (ROBs) as micro-ops a= re parsed and loaded in. What if you need more registers than you have in = the renaming list? Stall instruction parsing until one is freed up. Almos= t never happens. Even though the AMD64 ISA has 16 integer register names**= that can appear in a program, CPUs have about a hundred renaming registers= per core. Recent architectures, Intel's Broadwell and AMD's Zen have 168 = entry integer register files. (This means 1344 total in an eight core chip= . Note that the renaming registers are shared between two threads per core= . There are only 160 floating-point renaming registers in Zen--1280 total. = This is eight per core shy of Intel's Skylake and Kaby Lake.) >=20 >=20 > So what can happen that is a potential problem? All those renaming regis= ters, mean that the most recent written value of Write_Index can trail writ= es to the FIFO queue. Normally this will only happen when you inline code,= and there are multiple instances of the write routine inlined into a singl= e (callable) procedure or function. Now you can have several values in the= FIFO which will not yet be readable. As long as you are not worried about= (nanosecond) latency, not an issue. Worst case, the same thing happens on= the read side, and the writing side needs to wait. Being a former real-ti= me programmer, you don't want the writes getting thrown on the floor. >=20 > You can have the same problem if you set the write side at a higher prior= ity than the read side on a single core no multithreading CPU. In either c= ase, the solution is the same: Make the FIFO large enough that such collis= ions (between the two indexes) don't happen. This is the sort of issue tha= t the proof of correctness folk don't see. You can (and should) test for t= his in any hard real-time system, no matter how much theorem proving you do= . Assuming you do it right, now you have average and maximum queuing laten= cies to feed back into your theorem prover. I'm not against theorem provin= g, I just know from experience that analysis and testing is also needed. ;-= ) >=20 > * Technically use a value before the pipe computing it has it ready for u= se. Due to the nature of pipelined architectures, it matters not if that v= alue is not ready until stage 10, if the instruction that needs it won't ne= ed it until (its) stage 9. The CPU sees that as those two instructions can= not be dispatched on the same clock cycle, but otherwise the "store forward= ing" makes it look like it only took one clock cycle to compute. If you--a= nd the CPU--are clever, you can even have the two micro-ops in the same pip= e, one after the other. If you are even more clever, you can choose micro-= ops such that they can execute on the same clock cycle. >=20 > It is not that difficult to write the innermost loops of some programs to= take only one clock cycle on average per iteration. You have three or mor= e instructions executing together, as parts of the loop for iterations n, n= -1, and n-2. So any iteration of the loop takes two clock cycles plus the = pipeline depth, but the average iteration take very close to one clock cycl= e. >=20 > ** Well 64-bit integer registers or parts of one, plus some other registe= rs that it makes sense to treat as just another integer register. Floating= point values have different register names, and their own large renaming r= egister set. I am still digesting some of this, so forgive the simple question: What prevents the last 3 lines from being reordered like this: r1 :=3D r1 + 1; (Write_Index'Address) :=3D r1; (r3) :=3D r4; I may have misunderstood your post, but isn't it free to make this adjustme= nt=20 to the instructions? There is no dependence on r3 or r4 for the=20 Write_Index'Address part or the r1 increment. I ask because this very issu= e=20 has (allegedly) been seen implementing lock free designs in other languages= =20 like C/C++. If it is, then the risk is that after (Write_Index'Address) :=3D r1; a task context switch or interrupt occurs and now the index is incremented before the data was added (and now pointing to junk). By the way, thank you for the very detailed explanation. I can't say that I fully understand it all yet (as probably indicated from my question)= . =20 I'm probably gonna have to reread it a few times.