From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM
	autolearn=unavailable autolearn_force=no version=3.4.4
X-Received: by 10.13.207.129 with SMTP id r123mr22353689ywd.66.1494285881977;
        Mon, 08 May 2017 16:24:41 -0700 (PDT)
X-Received: by 10.157.31.68 with SMTP id x4mr1307047otx.19.1494285881924; Mon,
 08 May 2017 16:24:41 -0700 (PDT)
Path: 
 eternal-september.org!reader01.eternal-september.org!reader02.eternal-september.org!news.eternal-september.org!news.eternal-september.org!feeder.eternal-september.org!news.glorb.com!t26no454783qtg.1!news-out.google.com!v18ni2377ita.0!nntp.google.com!c26no1602498itd.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.ada
Date: Mon, 8 May 2017 16:24:41 -0700 (PDT)
In-Reply-To: <oeqmqm$7ho$2@gioia.aioe.org>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com;
 posting-host=2601:191:8303:2100:5985:2c17:9409:aa9c;
 posting-account=fdRd8woAAADTIlxCu9FgvDrUK4wPzvy3
NNTP-Posting-Host: 2601:191:8303:2100:5985:2c17:9409:aa9c
References: <e8d229c0-317e-4f5a-8d54-cf56c3aede7a@googlegroups.com>
 <0fc56bf7-1cfa-4776-9c47-a573db315c5f@googlegroups.com>
 <oep7mj$1g3f$1@gioia.aioe.org>
 <e87ac3c3-8390-4d1d-979d-0bcc7e4d3dde@googlegroups.com>
 <oeq60s$18c1$1@gioia.aioe.org>
 <7b0c08eb-be62-4d14-ae99-cad038ad0a62@googlegroups.com>
 <oeqmqm$7ho$2@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <077e7f6a-5a7b-4b88-a16f-7672aec18a17@googlegroups.com>
Subject: Re: Portable memory barrier?
From: Robert Eachus <rieachus@comcast.net>
Injection-Date: Mon, 08 May 2017 23:24:41 +0000
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Xref: news.eternal-september.org comp.lang.ada:46712
Date: 2017-05-08T16:24:41-07:00
List-Id: <comp.lang.ada>

On Monday, May 8, 2017 at 5:09:45 PM UTC-4, Dmitry A. Kazakov wrote:

> This strategy does not prevent reordering writes:
>=20
>     FIFO (Write_Index + 1) :=3D Element;
>     Write_Index :=3D Write_Index + 1;
>=20
> into
>=20
>     Write_Index :=3D Write_Index + 1;
>     FIFO (Write_Index) :=3D Element;
>=20
> which would be a bug.

You are UNDERthinking it.  There is a situation which can occur, and progra=
mmers should be wary of. (Skip to before the footnotes if you don't care ab=
out details.)

What happens is that the code is broken into:

r1 :=3D Write_Index;
...
r2 :=3D r1 x Element'Length;
r3 :=3D r2 + FIFO'First;
r4 :=3D Element;
(r3) :=3D r4;
r1 :=3D r1 + 1;
(Write_Index'Address) :=3D r1;

(Parentheses write to the address.)

The compiler will play some games, and certainly will leave Write_Index in =
a register if possible.  But with each of these assignments, there is a con=
text consisting of its set of registers.  So the two assignments to r1 go t=
o different (renaming) registers. We dpn't care what the CPU calls them, as=
 long as it keeps track of which r1 belongs to which contexts.

Now the CPU can reorder to its heart's content, as long as it doesn't try t=
o use a value before it has been computed.*  Old values?  No problem, they =
are available in renamed registers.  The value of Write_Index (in r1) when =
the write to FIFO completes?  Who cares, it was used to compute the correct=
 address earlier.

How does the CPU know when a value in a register is no longer needed? When =
all of the contexts containing that value of the register have been used.  =
Count every use of the value until the register is changed, and, of course,=
 reduce the count on use.  The counts are not unbounded because you only co=
unt for current residents of the reordering buffers (ROBs) as micro-ops are=
 parsed and loaded in.  What if you need more registers than you have in th=
e renaming list?  Stall instruction parsing until one is freed up.  Almost =
never happens.  Even though the AMD64 ISA has 16 integer register names** t=
hat can appear in a program, CPUs have about a hundred renaming registers p=
er core.  Recent architectures, Intel's Broadwell and AMD's Zen have 168 en=
try integer register files.  (This means 1344 total in an eight core chip. =
 Note that the renaming registers are shared between two threads per core. =
There are only 160 floating-point renaming registers in Zen--1280 total.  T=
his is eight per core shy of Intel's Skylake and Kaby Lake.)


So what can happen that is a potential problem?  All those renaming registe=
rs, mean that the most recent written value of Write_Index can trail writes=
 to the FIFO queue.  Normally this will only happen when you inline code, a=
nd there are multiple instances of the write routine inlined into a single =
(callable) procedure or function.  Now you can have several values in the F=
IFO which will not yet be readable.  As long as you are not worried about (=
nanosecond) latency, not an issue.  Worst case, the same thing happens on t=
he read side, and the writing side needs to wait.  Being a former real-time=
 programmer, you don't want the writes getting thrown on the floor.

You can have the same problem if you set the write side at a higher priorit=
y than the read side on a single core no multithreading CPU.  In either cas=
e, the solution is the same:  Make the FIFO large enough that such collisio=
ns (between the two indexes) don't happen.  This is the sort of issue that =
the proof of correctness folk don't see.  You can (and should) test for thi=
s in any hard real-time system, no matter how much theorem proving you do. =
 Assuming you do it right, now you have average and maximum queuing latenci=
es to feed back into your theorem prover.  I'm not against theorem proving,=
 I just know from experience that analysis and testing is also needed. ;-)

* Technically use a value before the pipe computing it has it ready for use=
.  Due to the nature of pipelined architectures, it matters not if that val=
ue is not ready until stage 10, if the instruction that needs it won't need=
 it until (its) stage 9.  The CPU sees that as those two instructions canno=
t be dispatched on the same clock cycle, but otherwise the "store forwardin=
g" makes it look like it only took one clock cycle to compute.  If you--and=
 the CPU--are clever, you can even have the two micro-ops in the same pipe,=
 one after the other.  If you are even more clever, you can choose micro-op=
s such that they can execute on the same clock cycle.

It is not that difficult to write the innermost loops of some programs to t=
ake only one clock cycle on average per iteration.  You have three or more =
instructions executing together, as parts of the loop for iterations n, n-1=
, and n-2.  So any iteration of the loop takes two clock cycles plus the pi=
peline depth, but the average iteration take very close to one clock cycle.

** Well 64-bit integer registers or parts of one, plus some other registers=
 that it makes sense to treat as just another integer register.  Floating p=
oint values have different register names, and their own large renaming reg=
ister set.