From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=unavailable autolearn_force=no version=3.4.4
X-Received: by 10.98.220.215 with SMTP id c84mr3549618pfl.6.1494302565050;
        Mon, 08 May 2017 21:02:45 -0700 (PDT)
X-Received: by 10.157.37.213 with SMTP id q79mr1323880ota.11.1494302564898;
 Mon, 08 May 2017 21:02:44 -0700 (PDT)
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!news.glorb.com!c26no1643447itd.0!news-out.google.com!v18ni2579ita.0!nntp.google.com!c26no1643445itd.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Mon, 8 May 2017 21:02:44 -0700 (PDT)
In-Reply-To: <8a968aae-79e4-421b-ba4b-e0a9a33ce0db@googlegroups.com>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com;
 posting-host=2601:191:8303:2100:5985:2c17:9409:aa9c;
 posting-account=fdRd8woAAADTIlxCu9FgvDrUK4wPzvy3
NNTP-Posting-Host: 2601:191:8303:2100:5985:2c17:9409:aa9c
References: <e8d229c0-317e-4f5a-8d54-cf56c3aede7a@googlegroups.com>
 <0fc56bf7-1cfa-4776-9c47-a573db315c5f@googlegroups.com>
 <oep7mj$1g3f$1@gioia.aioe.org>
 <e87ac3c3-8390-4d1d-979d-0bcc7e4d3dde@googlegroups.com>
 <oeq60s$18c1$1@gioia.aioe.org>
 <7b0c08eb-be62-4d14-ae99-cad038ad0a62@googlegroups.com>
 <oeqmqm$7ho$2@gioia.aioe.org>
 <077e7f6a-5a7b-4b88-a16f-7672aec18a17@googlegroups.com>
 <8a968aae-79e4-421b-ba4b-e0a9a33ce0db@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c005f57d-d731-4615-8228-bb7f9792fb27@googlegroups.com>
Subject: Re: Portable memory barrier?
From: Robert Eachus <rieachus@comcast.net>
Injection-Date: Tue, 09 May 2017 04:02:44 +0000
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Xref: news.eternal-september.org comp.lang.ada:46716
Date: 2017-05-08T21:02:44-07:00
List-Id: <comp.lang.ada>

On Monday, May 8, 2017 at 8:30:02 PM UTC-4, Jere wrote:

First, oops!  I missed that the program computes Write_Index + 1 twice. The=
 CPU won't, but th code sequence should be:

r1 :=3D Write_Index;=20
...=20
r1 :=3D r1 + 1;
r2 :=3D r1 x Element'Length;=20
r3 :=3D r2 + FIFO'First;=20
r4 :=3D Element;=20
(r3) :=3D r4;=20
(Write_Index'Address) :=3D r1;=20

(Parentheses write to the address.)
=20
> What prevents the last 3 lines from being reordered like this:
> r1 :=3D r1 + 1;
> (Write_Index'Address) :=3D r1;
> (r3) :=3D r4;

The rule against reordering of writes.  I think your question still stands,=
 but that is the answer, moves to registers are reads, moves to memory are =
writes.  Reads can be moved before writes to other locations, but order wit=
hin writes has to be maintained.

> I may have misunderstood your post, but isn't it free to make this adjust=
ment=20
> to the instructions?  There is no dependence on r3 or r4 for the=20
> Write_Index'Address part or the r1 increment.  I ask because this very is=
sue=20
> has (allegedly) been seen implementing lock free designs in other languag=
es=20
> like C/C++.
>=20
> If it is, then the risk is that after
> (Write_Index'Address) :=3D r1;
>=20
> a task context switch or interrupt occurs and now the index is incremente=
d
> before the data was added (and now pointing to junk).

No. Or at least it should not happen.  A single register set exists when th=
e interrupt is allow to occur.  This may have the (correct) value of Write_=
Index=20
in memory (or not) or in r1 (or not).  If r1 has the old or new value, not =
a problem.  If the write to FIFO occurred, the write to Write_Index must ha=
ve happened first (on an x86 or AMD64 CPU at least).  In other words, the w=
rite pipe will get emptied before the cache gets flushed.  Note that the ca=
che flush does not occur until after the interrupt.*  On other (usually RIS=
C) architectures, the interrupt service routine may have to execute a "fenc=
e" instruction to force the CPU to a consistent state. (x86 does have three=
 fence instructions: I found an interesting writeup  that shows that they a=
re very occasionally needed: https://bartoszmilewski.com/2008/11/05/who-ord=
ered-memory-fences-on-an-x86/) Fences are more often needed on RISC CPUs, s=
ince the ordering rules are more relaxed.  That may be where the problem yo=
u noted occurred.  Especially in embedded controllers, there can be interru=
pt service routines not supplied by the OS.  Especially if the other half o=
f this stream is implemented in the interrupt, the fence instruction is nec=
essary.

In general do not use fences in Ada code, use protected operations.  At a m=
inimum the lock needed for a cleverly written protected object can be repla=
ced by a read-modify-write (RMW) instruction, and the CPU should execute it=
 in under a nanosecond.  Code cannot be moved past RMW operations, and they=
 are much faster than fences.

Why isn't a fence needed here?  When it comes to shared variables, one thre=
ad only writes.  But the example is flawed, in that we do have an access th=
at needs to be protected.  To make the buffer a circular buffer you need to=
 add a test and adjustment.  For simplicity? change:

 FIFO (Write_Index + 1) :=3D Element;=20
 Write_Index :=3D Write_Index + 1;

>  to:

Temp :=3D Write_Index;
if Temp =3D Max=20
then Temp :=3D 1;
else Temp :=3D Temp + 1;
end if;
FIFO (Temp) :=3D Element;=20
Write_Index :=3D Temp;

Making Temp explicit is a belt and suspenders operation.  The CPU will gene=
rate the same set of micro-ops, or at least the same set as once we make th=
e buffer circular.  But now it should be clear that the two writes cannot b=
e reordered. Note that the writes to Temp can be replaced by writes to R1 a=
nd are reads for the x86 rules.


...and the same for Read_Index.=20

Simpler?  Huh?  What are we doing here?
The same thing we did in the pseudocode, but=20
> I'm probably gonna have to reread it a few times.

I could probably teach a graduate course on this, and not include anything =
not touched on here.  And, yes I did teach a graduate course in compiler co=
nstruction several times back before this revolution in how CPUs are put to=
gether.  If I was teaching that course now, I would either limit it to the =
front end and a bit of optimization, or stretch it into two semesters.  I w=
on't say that this stuff makes quantum mechanics look easy, but they are at=
 the same level of complexity.  (In fact you may have seen that Elon Musk a=
mong others believe that there is a good chance we are living in a simulati=
on. The similarities are eerie, and the as ifs pop up in the same places.)

*  Especially if you are running on an AMD CPU in a virtual processor.  As =
long as the CPU stays in the hypervisor state, the cache doesn't need to be=
 flushed.  That's a detail and a huge bite of extra complexity here.  It al=
lows the hypervisor to service a page fault without flushing the cache.  Of=
 course, if there is a context switch, even to the OS on the virtual machin=
e, the cache flush is needed. Usually if the page is not present in memory =
the cache flush (and a context switch) happens, then the hypervisor turns c=
ontrol over to the "real" OS.  (Disks even SSDs are slow.) Otherwise the hy=
pervisor patches up the page tables and lets the (faulting) instruction be =
retried never touching the OS.