From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=unavailable autolearn_force=no version=3.4.4 X-Received: by 10.98.220.215 with SMTP id c84mr3549618pfl.6.1494302565050; Mon, 08 May 2017 21:02:45 -0700 (PDT) X-Received: by 10.157.37.213 with SMTP id q79mr1323880ota.11.1494302564898; Mon, 08 May 2017 21:02:44 -0700 (PDT) Path: eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!news.glorb.com!c26no1643447itd.0!news-out.google.com!v18ni2579ita.0!nntp.google.com!c26no1643445itd.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ada Date: Mon, 8 May 2017 21:02:44 -0700 (PDT) In-Reply-To: <8a968aae-79e4-421b-ba4b-e0a9a33ce0db@googlegroups.com> Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=2601:191:8303:2100:5985:2c17:9409:aa9c; posting-account=fdRd8woAAADTIlxCu9FgvDrUK4wPzvy3 NNTP-Posting-Host: 2601:191:8303:2100:5985:2c17:9409:aa9c References: <0fc56bf7-1cfa-4776-9c47-a573db315c5f@googlegroups.com> <7b0c08eb-be62-4d14-ae99-cad038ad0a62@googlegroups.com> <077e7f6a-5a7b-4b88-a16f-7672aec18a17@googlegroups.com> <8a968aae-79e4-421b-ba4b-e0a9a33ce0db@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: Subject: Re: Portable memory barrier? From: Robert Eachus Injection-Date: Tue, 09 May 2017 04:02:44 +0000 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Xref: news.eternal-september.org comp.lang.ada:46716 Date: 2017-05-08T21:02:44-07:00 List-Id: On Monday, May 8, 2017 at 8:30:02 PM UTC-4, Jere wrote: First, oops! I missed that the program computes Write_Index + 1 twice. The= CPU won't, but th code sequence should be: r1 :=3D Write_Index;=20 ...=20 r1 :=3D r1 + 1; r2 :=3D r1 x Element'Length;=20 r3 :=3D r2 + FIFO'First;=20 r4 :=3D Element;=20 (r3) :=3D r4;=20 (Write_Index'Address) :=3D r1;=20 (Parentheses write to the address.) =20 > What prevents the last 3 lines from being reordered like this: > r1 :=3D r1 + 1; > (Write_Index'Address) :=3D r1; > (r3) :=3D r4; The rule against reordering of writes. I think your question still stands,= but that is the answer, moves to registers are reads, moves to memory are = writes. Reads can be moved before writes to other locations, but order wit= hin writes has to be maintained. > I may have misunderstood your post, but isn't it free to make this adjust= ment=20 > to the instructions? There is no dependence on r3 or r4 for the=20 > Write_Index'Address part or the r1 increment. I ask because this very is= sue=20 > has (allegedly) been seen implementing lock free designs in other languag= es=20 > like C/C++. >=20 > If it is, then the risk is that after > (Write_Index'Address) :=3D r1; >=20 > a task context switch or interrupt occurs and now the index is incremente= d > before the data was added (and now pointing to junk). No. Or at least it should not happen. A single register set exists when th= e interrupt is allow to occur. This may have the (correct) value of Write_= Index=20 in memory (or not) or in r1 (or not). If r1 has the old or new value, not = a problem. If the write to FIFO occurred, the write to Write_Index must ha= ve happened first (on an x86 or AMD64 CPU at least). In other words, the w= rite pipe will get emptied before the cache gets flushed. Note that the ca= che flush does not occur until after the interrupt.* On other (usually RIS= C) architectures, the interrupt service routine may have to execute a "fenc= e" instruction to force the CPU to a consistent state. (x86 does have three= fence instructions: I found an interesting writeup that shows that they a= re very occasionally needed: https://bartoszmilewski.com/2008/11/05/who-ord= ered-memory-fences-on-an-x86/) Fences are more often needed on RISC CPUs, s= ince the ordering rules are more relaxed. That may be where the problem yo= u noted occurred. Especially in embedded controllers, there can be interru= pt service routines not supplied by the OS. Especially if the other half o= f this stream is implemented in the interrupt, the fence instruction is nec= essary. In general do not use fences in Ada code, use protected operations. At a m= inimum the lock needed for a cleverly written protected object can be repla= ced by a read-modify-write (RMW) instruction, and the CPU should execute it= in under a nanosecond. Code cannot be moved past RMW operations, and they= are much faster than fences. Why isn't a fence needed here? When it comes to shared variables, one thre= ad only writes. But the example is flawed, in that we do have an access th= at needs to be protected. To make the buffer a circular buffer you need to= add a test and adjustment. For simplicity? change: FIFO (Write_Index + 1) :=3D Element;=20 Write_Index :=3D Write_Index + 1; > to: Temp :=3D Write_Index; if Temp =3D Max=20 then Temp :=3D 1; else Temp :=3D Temp + 1; end if; FIFO (Temp) :=3D Element;=20 Write_Index :=3D Temp; Making Temp explicit is a belt and suspenders operation. The CPU will gene= rate the same set of micro-ops, or at least the same set as once we make th= e buffer circular. But now it should be clear that the two writes cannot b= e reordered. Note that the writes to Temp can be replaced by writes to R1 a= nd are reads for the x86 rules. ...and the same for Read_Index.=20 Simpler? Huh? What are we doing here? The same thing we did in the pseudocode, but=20 > I'm probably gonna have to reread it a few times. I could probably teach a graduate course on this, and not include anything = not touched on here. And, yes I did teach a graduate course in compiler co= nstruction several times back before this revolution in how CPUs are put to= gether. If I was teaching that course now, I would either limit it to the = front end and a bit of optimization, or stretch it into two semesters. I w= on't say that this stuff makes quantum mechanics look easy, but they are at= the same level of complexity. (In fact you may have seen that Elon Musk a= mong others believe that there is a good chance we are living in a simulati= on. The similarities are eerie, and the as ifs pop up in the same places.) * Especially if you are running on an AMD CPU in a virtual processor. As = long as the CPU stays in the hypervisor state, the cache doesn't need to be= flushed. That's a detail and a huge bite of extra complexity here. It al= lows the hypervisor to service a page fault without flushing the cache. Of= course, if there is a context switch, even to the OS on the virtual machin= e, the cache flush is needed. Usually if the page is not present in memory = the cache flush (and a context switch) happens, then the hypervisor turns c= ontrol over to the "real" OS. (Disks even SSDs are slow.) Otherwise the hy= pervisor patches up the page tables and lets the (faulting) instruction be = retried never touching the OS.