* Re: Portable memory barrier?
2017-05-08 16:22 ` Dmitry A. Kazakov
@ 2017-05-08 18:39 ` Robert Eachus
2017-05-08 19:18 ` Robert Eachus
` (3 subsequent siblings)
4 siblings, 0 replies; 65+ messages in thread
From: Robert Eachus @ 2017-05-08 18:39 UTC (permalink / raw)
On Monday, May 8, 2017 at 12:22:55 PM UTC-4, Dmitry A. Kazakov wrote:
> On 2017-05-08 17:56, Robert Eachus wrote:
> > One sentence summary: The CPU will play lots of tricks behind your
> > back, but in such a way you can pretend your code works as you intended.
> > (Er, after you get the bugs out.)
> >
> > On Monday, May 8, 2017 at 3:45:26 AM UTC-4, Dmitry A. Kazakov wrote:
> >
> >> I don't believe either of the sequences can get reordered by the CPU,
> >> but it is only belief and common sense.
> >
> > Silly rabbit, Trix are for kids. Oops, silly programmer, common
> > sense has nothing to do with modern CPUs.
> [...]
> > Is any of this important to a high-level language programmer? Not
> > really, unless they are working in hard real time. (Which I did.) If you
> > program in assembler, or more likely work on a compiler back-end,
> > knowing which instructions fit together is important. Even more
> > important is coding so that an instruction that references a new cache
> > line in memory is followed by as many instructions as possible that
> > don't use that value. Some times you run into cases where you want to
> > take advantage of write combining or some other trick involving the
> > write pipe--but today, smart compiler optimization is about accesses to
> > main memory and not much else.
>
> I don't see anything there what would prevent CPU from changing the
> order of accesses to two *independent* memory locations when only the
> program logic ties them nothing else. Well, except that there is no
> obvious reason why it would like to change that order.
>
> Wikipedia provides a more simple example of the problem:
>
> https://en.wikipedia.org/wiki/Memory_barrier
>
> The case of a lock-free FIFO is different because it involves index and
> array element.
>
> For practical use when the buffer size is > 1 there cannot be any problem.
>
> Yet the question stands. In particular, which Ada primitives may ensure
> safe implementations of lock-free structures when no protected objects used.
>
> --
> Regards,
> Dmitry A. Kazakov
> http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-08 16:22 ` Dmitry A. Kazakov
2017-05-08 18:39 ` Robert Eachus
@ 2017-05-08 19:18 ` Robert Eachus
2017-05-08 21:09 ` Dmitry A. Kazakov
2017-05-08 20:29 ` Niklas Holsti
` (2 subsequent siblings)
4 siblings, 1 reply; 65+ messages in thread
From: Robert Eachus @ 2017-05-08 19:18 UTC (permalink / raw)
On Monday, May 8, 2017 at 12:22:55 PM UTC-4, Dmitry A. Kazakov wrote:
> Yet the question stands. In particular, which Ada primitives may ensure
> safe implementations of lock-free structures when no protected objects used.
There are two rules on x86 and AMD64 CPUs that govern:
Instructions must be retired in the correct order. (In practice, CPUs retire instructions in bunches.)
The execution state "seen" by an instruction reflects those instructions which were retired before it, and only those instructions.
In other words, the index update will occur, but the value to be used in the data field write will be kept in the original register. This may result in two or more registers named AX, but that is a detail. Only the appropriate version will be visible to any other instruction.
So the CPU can, and normally will, move the counter update before the data write. The CPU wants to merge two writes to the same cache line when possible. Obviously, it won't wait for another write of the counter or buffer, but if they are present in the reorder buffers, the CPU will do what it can to merge them. (The first one goes into the write pipe, but the CPU noticed that it will need that value for the associated data write. So it pulls a copy out of the write buffer, while leaving the original write in place. If the next update becomes available while the index is still in the write pipe, the CPU will cancel the original write, or change the value to be written. (Depends on the particular hardware. Also the CPUs may keep a copy in a renamed register, or it may rely on the copy in the write pipe. Depends on the timing, and on the size of the register.)
If, and ONLY if, there is an interrupt, and it is due to an instruction with an associated state (usually a cache miss) the CPU will stop in the correct state, no matter how much patching up that requires. (Minimize page faults, but you knew that. ;-) THEN the interrupt will occur. A page fault here is a reference to a memory location which is in the pagefile, not in memory. The CPU may get involved in a cache miss, but that process is usually all in hardware and tries to be invisible to the instruction stream. In other words, the CPU will try to anticipate cache misses, but when it does so, it won't cause an interrupt, just throw the prefetch away.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-08 19:18 ` Robert Eachus
@ 2017-05-08 21:09 ` Dmitry A. Kazakov
2017-05-08 23:24 ` Robert Eachus
0 siblings, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-08 21:09 UTC (permalink / raw)
On 2017-05-08 21:18, Robert Eachus wrote:
> So the CPU can, and normally will, move the counter update before
> the data write. The CPU wants to merge two writes to the same cache line
> when possible. Obviously, it won't wait for another write of the counter
> or buffer, but if they are present in the reorder buffers, the CPU will
> do what it can to merge them. (The first one goes into the write pipe,
> but the CPU noticed that it will need that value for the associated data
> write. So it pulls a copy out of the write buffer, while leaving the
> original write in place. If the next update becomes available while the
> index is still in the write pipe, the CPU will cancel the original
> write, or change the value to be written. (Depends on the particular
> hardware. Also the CPUs may keep a copy in a renamed register, or it may
> rely on the copy in the write pipe. Depends on the timing, and on the
> size of the register.)
This strategy does not prevent reordering writes:
FIFO (Write_Index + 1) := Element;
Write_Index := Write_Index + 1;
into
Write_Index := Write_Index + 1;
FIFO (Write_Index) := Element;
which would be a bug.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-08 21:09 ` Dmitry A. Kazakov
@ 2017-05-08 23:24 ` Robert Eachus
2017-05-09 0:30 ` Jere
2017-05-09 19:57 ` Randy Brukardt
0 siblings, 2 replies; 65+ messages in thread
From: Robert Eachus @ 2017-05-08 23:24 UTC (permalink / raw)
On Monday, May 8, 2017 at 5:09:45 PM UTC-4, Dmitry A. Kazakov wrote:
> This strategy does not prevent reordering writes:
>
> FIFO (Write_Index + 1) := Element;
> Write_Index := Write_Index + 1;
>
> into
>
> Write_Index := Write_Index + 1;
> FIFO (Write_Index) := Element;
>
> which would be a bug.
You are UNDERthinking it. There is a situation which can occur, and programmers should be wary of. (Skip to before the footnotes if you don't care about details.)
What happens is that the code is broken into:
r1 := Write_Index;
...
r2 := r1 x Element'Length;
r3 := r2 + FIFO'First;
r4 := Element;
(r3) := r4;
r1 := r1 + 1;
(Write_Index'Address) := r1;
(Parentheses write to the address.)
The compiler will play some games, and certainly will leave Write_Index in a register if possible. But with each of these assignments, there is a context consisting of its set of registers. So the two assignments to r1 go to different (renaming) registers. We dpn't care what the CPU calls them, as long as it keeps track of which r1 belongs to which contexts.
Now the CPU can reorder to its heart's content, as long as it doesn't try to use a value before it has been computed.* Old values? No problem, they are available in renamed registers. The value of Write_Index (in r1) when the write to FIFO completes? Who cares, it was used to compute the correct address earlier.
How does the CPU know when a value in a register is no longer needed? When all of the contexts containing that value of the register have been used. Count every use of the value until the register is changed, and, of course, reduce the count on use. The counts are not unbounded because you only count for current residents of the reordering buffers (ROBs) as micro-ops are parsed and loaded in. What if you need more registers than you have in the renaming list? Stall instruction parsing until one is freed up. Almost never happens. Even though the AMD64 ISA has 16 integer register names** that can appear in a program, CPUs have about a hundred renaming registers per core. Recent architectures, Intel's Broadwell and AMD's Zen have 168 entry integer register files. (This means 1344 total in an eight core chip. Note that the renaming registers are shared between two threads per core. There are only 160 floating-point renaming registers in Zen--1280 total. This is eight per core shy of Intel's Skylake and Kaby Lake.)
So what can happen that is a potential problem? All those renaming registers, mean that the most recent written value of Write_Index can trail writes to the FIFO queue. Normally this will only happen when you inline code, and there are multiple instances of the write routine inlined into a single (callable) procedure or function. Now you can have several values in the FIFO which will not yet be readable. As long as you are not worried about (nanosecond) latency, not an issue. Worst case, the same thing happens on the read side, and the writing side needs to wait. Being a former real-time programmer, you don't want the writes getting thrown on the floor.
You can have the same problem if you set the write side at a higher priority than the read side on a single core no multithreading CPU. In either case, the solution is the same: Make the FIFO large enough that such collisions (between the two indexes) don't happen. This is the sort of issue that the proof of correctness folk don't see. You can (and should) test for this in any hard real-time system, no matter how much theorem proving you do. Assuming you do it right, now you have average and maximum queuing latencies to feed back into your theorem prover. I'm not against theorem proving, I just know from experience that analysis and testing is also needed. ;-)
* Technically use a value before the pipe computing it has it ready for use. Due to the nature of pipelined architectures, it matters not if that value is not ready until stage 10, if the instruction that needs it won't need it until (its) stage 9. The CPU sees that as those two instructions cannot be dispatched on the same clock cycle, but otherwise the "store forwarding" makes it look like it only took one clock cycle to compute. If you--and the CPU--are clever, you can even have the two micro-ops in the same pipe, one after the other. If you are even more clever, you can choose micro-ops such that they can execute on the same clock cycle.
It is not that difficult to write the innermost loops of some programs to take only one clock cycle on average per iteration. You have three or more instructions executing together, as parts of the loop for iterations n, n-1, and n-2. So any iteration of the loop takes two clock cycles plus the pipeline depth, but the average iteration take very close to one clock cycle.
** Well 64-bit integer registers or parts of one, plus some other registers that it makes sense to treat as just another integer register. Floating point values have different register names, and their own large renaming register set.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-08 23:24 ` Robert Eachus
@ 2017-05-09 0:30 ` Jere
2017-05-09 4:02 ` Robert Eachus
` (2 more replies)
2017-05-09 19:57 ` Randy Brukardt
1 sibling, 3 replies; 65+ messages in thread
From: Jere @ 2017-05-09 0:30 UTC (permalink / raw)
On Monday, May 8, 2017 at 7:24:42 PM UTC-4, Robert Eachus wrote:
> On Monday, May 8, 2017 at 5:09:45 PM UTC-4, Dmitry A. Kazakov wrote:
>
> > This strategy does not prevent reordering writes:
> >
> > FIFO (Write_Index + 1) := Element;
> > Write_Index := Write_Index + 1;
> >
> > into
> >
> > Write_Index := Write_Index + 1;
> > FIFO (Write_Index) := Element;
> >
> > which would be a bug.
>
> You are UNDERthinking it. There is a situation which can occur, and programmers should be wary of. (Skip to before the footnotes if you don't care about details.)
>
> What happens is that the code is broken into:
>
> r1 := Write_Index;
> ...
> r2 := r1 x Element'Length;
> r3 := r2 + FIFO'First;
> r4 := Element;
> (r3) := r4;
> r1 := r1 + 1;
> (Write_Index'Address) := r1;
>
> (Parentheses write to the address.)
>
> The compiler will play some games, and certainly will leave Write_Index in a register if possible. But with each of these assignments, there is a context consisting of its set of registers. So the two assignments to r1 go to different (renaming) registers. We dpn't care what the CPU calls them, as long as it keeps track of which r1 belongs to which contexts.
>
> Now the CPU can reorder to its heart's content, as long as it doesn't try to use a value before it has been computed.* Old values? No problem, they are available in renamed registers. The value of Write_Index (in r1) when the write to FIFO completes? Who cares, it was used to compute the correct address earlier.
>
> How does the CPU know when a value in a register is no longer needed? When all of the contexts containing that value of the register have been used. Count every use of the value until the register is changed, and, of course, reduce the count on use. The counts are not unbounded because you only count for current residents of the reordering buffers (ROBs) as micro-ops are parsed and loaded in. What if you need more registers than you have in the renaming list? Stall instruction parsing until one is freed up. Almost never happens. Even though the AMD64 ISA has 16 integer register names** that can appear in a program, CPUs have about a hundred renaming registers per core. Recent architectures, Intel's Broadwell and AMD's Zen have 168 entry integer register files. (This means 1344 total in an eight core chip. Note that the renaming registers are shared between two threads per core. There are only 160 floating-point renaming registers in Zen--1280 total. This is eight per core shy of Intel's Skylake and Kaby Lake.)
>
>
> So what can happen that is a potential problem? All those renaming registers, mean that the most recent written value of Write_Index can trail writes to the FIFO queue. Normally this will only happen when you inline code, and there are multiple instances of the write routine inlined into a single (callable) procedure or function. Now you can have several values in the FIFO which will not yet be readable. As long as you are not worried about (nanosecond) latency, not an issue. Worst case, the same thing happens on the read side, and the writing side needs to wait. Being a former real-time programmer, you don't want the writes getting thrown on the floor.
>
> You can have the same problem if you set the write side at a higher priority than the read side on a single core no multithreading CPU. In either case, the solution is the same: Make the FIFO large enough that such collisions (between the two indexes) don't happen. This is the sort of issue that the proof of correctness folk don't see. You can (and should) test for this in any hard real-time system, no matter how much theorem proving you do. Assuming you do it right, now you have average and maximum queuing latencies to feed back into your theorem prover. I'm not against theorem proving, I just know from experience that analysis and testing is also needed. ;-)
>
> * Technically use a value before the pipe computing it has it ready for use. Due to the nature of pipelined architectures, it matters not if that value is not ready until stage 10, if the instruction that needs it won't need it until (its) stage 9. The CPU sees that as those two instructions cannot be dispatched on the same clock cycle, but otherwise the "store forwarding" makes it look like it only took one clock cycle to compute. If you--and the CPU--are clever, you can even have the two micro-ops in the same pipe, one after the other. If you are even more clever, you can choose micro-ops such that they can execute on the same clock cycle.
>
> It is not that difficult to write the innermost loops of some programs to take only one clock cycle on average per iteration. You have three or more instructions executing together, as parts of the loop for iterations n, n-1, and n-2. So any iteration of the loop takes two clock cycles plus the pipeline depth, but the average iteration take very close to one clock cycle.
>
> ** Well 64-bit integer registers or parts of one, plus some other registers that it makes sense to treat as just another integer register. Floating point values have different register names, and their own large renaming register set.
I am still digesting some of this, so forgive the simple question:
What prevents the last 3 lines from being reordered like this:
r1 := r1 + 1;
(Write_Index'Address) := r1;
(r3) := r4;
I may have misunderstood your post, but isn't it free to make this adjustment
to the instructions? There is no dependence on r3 or r4 for the
Write_Index'Address part or the r1 increment. I ask because this very issue
has (allegedly) been seen implementing lock free designs in other languages
like C/C++.
If it is, then the risk is that after
(Write_Index'Address) := r1;
a task context switch or interrupt occurs and now the index is incremented
before the data was added (and now pointing to junk).
By the way, thank you for the very detailed explanation. I can't say
that I fully understand it all yet (as probably indicated from my question).
I'm probably gonna have to reread it a few times.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 0:30 ` Jere
@ 2017-05-09 4:02 ` Robert Eachus
2017-05-09 4:32 ` Robert Eachus
2017-05-09 20:01 ` Randy Brukardt
2 siblings, 0 replies; 65+ messages in thread
From: Robert Eachus @ 2017-05-09 4:02 UTC (permalink / raw)
On Monday, May 8, 2017 at 8:30:02 PM UTC-4, Jere wrote:
First, oops! I missed that the program computes Write_Index + 1 twice. The CPU won't, but th code sequence should be:
r1 := Write_Index;
...
r1 := r1 + 1;
r2 := r1 x Element'Length;
r3 := r2 + FIFO'First;
r4 := Element;
(r3) := r4;
(Write_Index'Address) := r1;
(Parentheses write to the address.)
> What prevents the last 3 lines from being reordered like this:
> r1 := r1 + 1;
> (Write_Index'Address) := r1;
> (r3) := r4;
The rule against reordering of writes. I think your question still stands, but that is the answer, moves to registers are reads, moves to memory are writes. Reads can be moved before writes to other locations, but order within writes has to be maintained.
> I may have misunderstood your post, but isn't it free to make this adjustment
> to the instructions? There is no dependence on r3 or r4 for the
> Write_Index'Address part or the r1 increment. I ask because this very issue
> has (allegedly) been seen implementing lock free designs in other languages
> like C/C++.
>
> If it is, then the risk is that after
> (Write_Index'Address) := r1;
>
> a task context switch or interrupt occurs and now the index is incremented
> before the data was added (and now pointing to junk).
No. Or at least it should not happen. A single register set exists when the interrupt is allow to occur. This may have the (correct) value of Write_Index
in memory (or not) or in r1 (or not). If r1 has the old or new value, not a problem. If the write to FIFO occurred, the write to Write_Index must have happened first (on an x86 or AMD64 CPU at least). In other words, the write pipe will get emptied before the cache gets flushed. Note that the cache flush does not occur until after the interrupt.* On other (usually RISC) architectures, the interrupt service routine may have to execute a "fence" instruction to force the CPU to a consistent state. (x86 does have three fence instructions: I found an interesting writeup that shows that they are very occasionally needed: https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/) Fences are more often needed on RISC CPUs, since the ordering rules are more relaxed. That may be where the problem you noted occurred. Especially in embedded controllers, there can be interrupt service routines not supplied by the OS. Especially if the other half of this stream is implemented in the interrupt, the fence instruction is necessary.
In general do not use fences in Ada code, use protected operations. At a minimum the lock needed for a cleverly written protected object can be replaced by a read-modify-write (RMW) instruction, and the CPU should execute it in under a nanosecond. Code cannot be moved past RMW operations, and they are much faster than fences.
Why isn't a fence needed here? When it comes to shared variables, one thread only writes. But the example is flawed, in that we do have an access that needs to be protected. To make the buffer a circular buffer you need to add a test and adjustment. For simplicity? change:
FIFO (Write_Index + 1) := Element;
Write_Index := Write_Index + 1;
> to:
Temp := Write_Index;
if Temp = Max
then Temp := 1;
else Temp := Temp + 1;
end if;
FIFO (Temp) := Element;
Write_Index := Temp;
Making Temp explicit is a belt and suspenders operation. The CPU will generate the same set of micro-ops, or at least the same set as once we make the buffer circular. But now it should be clear that the two writes cannot be reordered. Note that the writes to Temp can be replaced by writes to R1 and are reads for the x86 rules.
...and the same for Read_Index.
Simpler? Huh? What are we doing here?
The same thing we did in the pseudocode, but
> I'm probably gonna have to reread it a few times.
I could probably teach a graduate course on this, and not include anything not touched on here. And, yes I did teach a graduate course in compiler construction several times back before this revolution in how CPUs are put together. If I was teaching that course now, I would either limit it to the front end and a bit of optimization, or stretch it into two semesters. I won't say that this stuff makes quantum mechanics look easy, but they are at the same level of complexity. (In fact you may have seen that Elon Musk among others believe that there is a good chance we are living in a simulation. The similarities are eerie, and the as ifs pop up in the same places.)
* Especially if you are running on an AMD CPU in a virtual processor. As long as the CPU stays in the hypervisor state, the cache doesn't need to be flushed. That's a detail and a huge bite of extra complexity here. It allows the hypervisor to service a page fault without flushing the cache. Of course, if there is a context switch, even to the OS on the virtual machine, the cache flush is needed. Usually if the page is not present in memory the cache flush (and a context switch) happens, then the hypervisor turns control over to the "real" OS. (Disks even SSDs are slow.) Otherwise the hypervisor patches up the page tables and lets the (faulting) instruction be retried never touching the OS.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 0:30 ` Jere
2017-05-09 4:02 ` Robert Eachus
@ 2017-05-09 4:32 ` Robert Eachus
2017-05-09 4:44 ` Robert Eachus
2017-05-09 22:26 ` Jere
2017-05-09 20:01 ` Randy Brukardt
2 siblings, 2 replies; 65+ messages in thread
From: Robert Eachus @ 2017-05-09 4:32 UTC (permalink / raw)
Argh! I just accidentally deleted a long post due apparently to a bug in the new version of Windows 10. Anyway:
On Monday, May 8, 2017 at 8:30:02 PM UTC-4, Jere wrote:
> On Monday, May 8, 2017 at 7:24:42 PM UTC-4, Robert Eachus wrote:
> > On Monday, May 8, 2017 at 5:09:45 PM UTC-4, Dmitry A. Kazakov wrote:
> >
> > > This strategy does not prevent reordering writes:
> > >
> > > FIFO (Write_Index + 1) := Element;
> > > Write_Index := Write_Index + 1;
> > >
> > > into
> > >
> > > Write_Index := Write_Index + 1;
> > > FIFO (Write_Index) := Element;
> > >
> > > which would be a bug.
That requires reordering writes, which is not allowed in x86 and AMD64. There are fence instructions on RISC machines that may be needed. There are three fence instructions on x86. They are rarely used, and should be non-existent in user code. This is an instance where you need a fence on x86, or just to change a couple of instructions to RMW (Atomic): https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
> > What happens is that the code is broken into:
> >
> > r1 := Write_Index;
> > ...
> > r2 := r1 x Element'Length;
> > r3 := r2 + FIFO'First;
> > r4 := Element;
> > (r3) := r4;
> > r1 := r1 + 1;
> > (Write_Index'Address) := r1;
> >
> > (Parentheses write to the address.)
Oops! I misread the example. The right code is:
r1 := Write_Index;
...
r1 := r1 + 1; -- added
r2 := r1 x Element'Length;
r3 := r2 + FIFO'First;
r4 := Element;
(r3) := r4;
-- deleted from here.
(Write_Index'Address) := r1;
The important point is that the two writes to memory cannot be reordered. They will be put into the write pipe in that order.
> I am still digesting some of this, so forgive the simple question:
>
> What prevents the last 3 lines from being reordered like this:
> r1 := r1 + 1;
> (Write_Index'Address) := r1;
> (r3) := r4;
See above. Writes cannot be moved past writes. In practice the write pipe does some magic combining writes to the same cache line, etc. But it maintains the order on writes to cache. If you have the right equipment you can peek and see that the writes to main memory are misordered, or even missing. But without a signal analyzer or some such, you can't peek at main memory bypassing the cache.
> By the way, thank you for the very detailed explanation. I can't say
> that I fully understand it all yet (as probably indicated from my question).
> I'm probably gonna have to reread it a few times.
The material here fleshed out is probably a full semester course. I taught compiler construction as a graduate course in the past, and I would now insist on two semesters, or break it into front-end and code generation courses.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 4:32 ` Robert Eachus
@ 2017-05-09 4:44 ` Robert Eachus
2017-05-09 22:26 ` Jere
1 sibling, 0 replies; 65+ messages in thread
From: Robert Eachus @ 2017-05-09 4:44 UTC (permalink / raw)
On Tuesday, May 9, 2017 at 12:32:38 AM UTC-4, Robert Eachus wrote:
Did it again, well posted before I was finished editing. But this time I'm just going to add a few sentences:
When reading x86 rules, understand moves to registers as reads, moves to memory as writes even though they use the same op code. Oh, and notice that the strict ordering on writes is on moves to cache, not on when they get copied back to main memory. Same goes for reads. Reads and writes of different locations can be reordered, reads cannot be moved past reads, writes can not be moved past writes.
The rule on reordering of reads has a big gotcha. The CPU can speculatively read from a memory location whenever it wants. But it must not DO anything with that data until a real read comes along. This allows the cache to pull in a complete cache line among other things. And the most important part of not do anything is no interrupts.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 4:32 ` Robert Eachus
2017-05-09 4:44 ` Robert Eachus
@ 2017-05-09 22:26 ` Jere
1 sibling, 0 replies; 65+ messages in thread
From: Jere @ 2017-05-09 22:26 UTC (permalink / raw)
On Tuesday, May 9, 2017 at 12:32:38 AM UTC-4, Robert Eachus wrote:
> > I am still digesting some of this, so forgive the simple question:
> >
> > What prevents the last 3 lines from being reordered like this:
> > r1 := r1 + 1;
> > (Write_Index'Address) := r1;
> > (r3) := r4;
>
> See above. Writes cannot be moved past writes. In practice the write pipe does some magic combining writes to the same cache line, etc. But it maintains the order on writes to cache. If you have the right equipment you can peek and see that the writes to main memory are misordered, or even missing. But without a signal analyzer or some such, you can't peek at main memory bypassing the cache.
Again, thanks for the responses.
I wanted to get back to this. I wanted to do some testing, so I generated an
example of two tasks that each update an index and then copy another (2 stores
and a load). I ran it on two different intel processors (all I had to test
with): i7-4702MQ and i7-6700. In both cases, there was CPU instruction
reordering on the two stores (generated assembly had them in the correct order).
ON the first processor I was running GCC 4.1.2 and pragma Atomic was not working
as Randy described, so I had to insert a makeshift fence (I declared a protected
object), which prevented the reordering. On the second processor I was running
GCC 4.9.3 which properly inserted synchronization primitives to prevent CPU
instruction reordering, but if I enabled the GNAT pragma to turn off those
primitives, it operated like the GCC 4.1.2 version.
I'll put the test code below. I hope it illustrates the issue I was trying
to describe earlier.
Test_Reorder.ads
********************************************************************************
package Test_Reorder is
procedure Run;
end Test_Reorder;
Test_Reorder.adb
********************************************************************************
pragma Ada_95;
with Ada.Numerics.Discrete_Random;
with Ada.Text_IO;
package body Test_Reorder is
X : Integer := 0;
Y : Integer := 0;
X_Copy : Integer := 0;
Y_Copy : Integer := 0;
-- In GCC 4.9.3, these pragmas are sufficient
-- to prevent reordering. Not in GCC 4.1.2 though
pragma Atomic(X);
pragma Atomic(Y);
pragma Atomic(X_Copy);
pragma Atomic(Y_Copy);
-- Uncomment these to get CPU instruction reordering in 4.9.3
-- pragma Disable_Atomic_Synchronization(X);
-- pragma Disable_Atomic_Synchronization(Y);
-- pragma Disable_Atomic_Synchronization(X_Copy);
-- pragma Disable_Atomic_Synchronization(Y_Copy);
package Positive_Random is new Ada.Numerics.Discrete_Random(Positive);
Generator : Positive_Random.Generator;
function Random_Number return Natural is
begin
return Positive_Random.Random(Generator) mod 8;
end Random_Number;
task Task1 is
entry Start;
entry Stop;
end Task1;
task Task2 is
entry Start;
entry Stop;
end Task2;
task body Task1 is
begin
loop
accept Start;
-- Delay a random amount of time
while Random_Number /= 0 loop
null;
end loop;
X := 1; -- In GCC 4.1.2 I need to insert a fence after this
Y_Copy := Y;
accept Stop;
end loop;
end Task1;
task body Task2 is
begin
loop
accept Start;
-- Delay a random amount of time
while Random_Number /= 0 loop
null;
end loop;
Y := 1; -- In GCC 4.1.2 I need to insert a fence after this
X_Copy := X;
accept Stop;
end loop;
end Task2;
procedure Run is
Detections : Natural := 0;
Iterations : Natural := 0;
begin
Positive_Random.Reset(Generator);
loop
X := 0;
Y := 0;
Task1.Start;
Task2.Start;
Task1.Stop;
Task2.Stop;
Iterations := Iterations + 1;
-- If both are 0, then at least one got reordered
if X_Copy = 0 and Y_Copy = 0 then
Detections := Detections + 1;
Ada.Text_IO.Put_Line
(Integer'Image(Detections)
& " reorders found in "
& Integer'Image(Iterations)
& " iterations");
end if;
end loop;
end Run;
end Test_Reorder;
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 0:30 ` Jere
2017-05-09 4:02 ` Robert Eachus
2017-05-09 4:32 ` Robert Eachus
@ 2017-05-09 20:01 ` Randy Brukardt
2 siblings, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-09 20:01 UTC (permalink / raw)
"Jere" <jhb.chat@gmail.com> wrote in message
news:8a968aae-79e4-421b-ba4b-e0a9a33ce0db@googlegroups.com...
...
>I am still digesting some of this, so forgive the simple question:
Don't bother, it's irrelevant if you trust your compiler vendor. If you
don't, then you're out of luck on modern CPUs because mere mortals cannot
understand them.
>What prevents the last 3 lines from being reordered like this:
>r1 := r1 + 1;
>(Write_Index'Address) := r1;
>(r3) := r4;
Assuming again that Write_Index is atomic as is necessary to make any
guarentees, this translation is 100% wrong (can't put Write_Index into a
register), so reordering is irrelevant. If the CPU is going to do it anyway,
then the compiler has to do whatever the CPU vendor provided to prevent it.
If Write_Index isn't atomic, then you can make no assumptions about how the
code is executed and hopefully you aren't (it is purely sequential code).
Randy.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-08 23:24 ` Robert Eachus
2017-05-09 0:30 ` Jere
@ 2017-05-09 19:57 ` Randy Brukardt
2017-05-10 0:51 ` Jere
2017-05-10 16:19 ` Robert Eachus
1 sibling, 2 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-09 19:57 UTC (permalink / raw)
"Robert Eachus" <rieachus@comcast.net> wrote in message
news:077e7f6a-5a7b-4b88-a16f-7672aec18a17@googlegroups.com...
...
>The compiler will play some games, and certainly will leave Write_Index in
>a register if possible.
It better not is Write_Index is Atomic.
Which is my point on this discussion: what the CPU does is irrelevant if the
compiler is implemented correctly. If it isn't, then the compiler is wrong.
QED.
Only the compiler vendor need worry about this level of discussion.
Honestly, I doubt anyone else is qualified. (I'm not, and I AM a compiler
vendor... ;-)
Randy.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 19:57 ` Randy Brukardt
@ 2017-05-10 0:51 ` Jere
2017-05-10 5:25 ` J-P. Rosen
` (2 more replies)
2017-05-10 16:19 ` Robert Eachus
1 sibling, 3 replies; 65+ messages in thread
From: Jere @ 2017-05-10 0:51 UTC (permalink / raw)
On Tuesday, May 9, 2017 at 3:57:50 PM UTC-4, Randy Brukardt wrote:
> "Robert Eachus" wrote in message
> ...
> >The compiler will play some games, and certainly will leave Write_Index in
> >a register if possible.
>
> It better not is Write_Index is Atomic.
>
> Which is my point on this discussion: what the CPU does is irrelevant if the
> compiler is implemented correctly. If it isn't, then the compiler is wrong.
> QED.
>
> Only the compiler vendor need worry about this level of discussion.
> Honestly, I doubt anyone else is qualified. (I'm not, and I AM a compiler
> vendor... ;-)
>
> Randy.
Is there a method besides Atomic? If I am implementing a generic FIFO (lock
free) and the FIFO elements are complex data types that may not be able to be
atomic, do I have any other options or are protected objects my only way out?
As an example, if my FIFO was an array of some records that contained a 64 bit
integer? I know I can make the FIFO volatile, but as you indicated, that isn't
enough. I can't make it Atomic either.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-10 0:51 ` Jere
@ 2017-05-10 5:25 ` J-P. Rosen
2017-05-10 22:56 ` Jere
2017-05-10 7:13 ` Dmitry A. Kazakov
2017-05-10 16:38 ` Jeffrey R. Carter
2 siblings, 1 reply; 65+ messages in thread
From: J-P. Rosen @ 2017-05-10 5:25 UTC (permalink / raw)
Le 10/05/2017 à 02:51, Jere a écrit :
> Is there a method besides Atomic? If I am implementing a generic FIFO (lock
> free) and the FIFO elements are complex data types that may not be able to be
> atomic, do I have any other options or are protected objects my only way out?
Protected objects, or for a lower level, Synchronous_Task_Control. I
know that you want lock free algorithms, but out of curiosity, did you
measure the cost of locking algorithms? IOW, is all of this worth the
trouble?
--
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-10 5:25 ` J-P. Rosen
@ 2017-05-10 22:56 ` Jere
2017-05-11 7:36 ` Dmitry A. Kazakov
0 siblings, 1 reply; 65+ messages in thread
From: Jere @ 2017-05-10 22:56 UTC (permalink / raw)
On Wednesday, May 10, 2017 at 1:26:00 AM UTC-4, J-P. Rosen wrote:
> Le 10/05/2017 à 02:51, Jere a écrit :
> > Is there a method besides Atomic? If I am implementing a generic FIFO (lock
> > free) and the FIFO elements are complex data types that may not be able to be
> > atomic, do I have any other options or are protected objects my only way out?
>
> Protected objects, or for a lower level, Synchronous_Task_Control. I
> know that you want lock free algorithms, but out of curiosity, did you
> measure the cost of locking algorithms? IOW, is all of this worth the
> trouble?
I wasn't necessarily aiming for a specific speed/cost. I know I can easily
find a standard task safe container and use that and it might be faster (or at
least fast enough) on various platforms. This is more for a different option
from the normal. It might be faster on some platforms or slower, but it is
mostly meant to be a tool in toolkit.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-10 22:56 ` Jere
@ 2017-05-11 7:36 ` Dmitry A. Kazakov
2017-05-13 20:25 ` Jere
0 siblings, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-11 7:36 UTC (permalink / raw)
On 11/05/2017 00:56, Jere wrote:
> On Wednesday, May 10, 2017 at 1:26:00 AM UTC-4, J-P. Rosen wrote:
>> Le 10/05/2017 à 02:51, Jere a écrit :
>>> Is there a method besides Atomic? If I am implementing a generic FIFO (lock
>>> free) and the FIFO elements are complex data types that may not be able to be
>>> atomic, do I have any other options or are protected objects my only way out?
>>
>> Protected objects, or for a lower level, Synchronous_Task_Control. I
>> know that you want lock free algorithms, but out of curiosity, did you
>> measure the cost of locking algorithms? IOW, is all of this worth the
>> trouble?
> I wasn't necessarily aiming for a specific speed/cost. I know I can easily
> find a standard task safe container and use that and it might be faster (or at
> least fast enough) on various platforms. This is more for a different option
> from the normal. It might be faster on some platforms or slower, but it is
> mostly meant to be a tool in toolkit.
One issue is granularity of locking. If container operation is not
instant, and that depends on the nature of the elements, doing that on
the context of protected action is a bad idea. The solution for
protected objects is of using a mutex to guard the operation outside the
protected action. The cost is two protected actions per operation
instead of one.
A lock-free implementation does not have this issue. Note that whether
the FIFO index is atomic or in a protected object is no matter. In both
cases the method is lock-free.
So, Jere, when you become ARG (:-)), please, push for requirement to
support Atomic for all scalar objects. If the machine lacks
corresponding instructions the compiler must fall back to a protected
object. And conversely, if there is a strong case that using a protected
action might be more efficient than machine's atomic access
instructions, the compiler should use the former.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-11 7:36 ` Dmitry A. Kazakov
@ 2017-05-13 20:25 ` Jere
0 siblings, 0 replies; 65+ messages in thread
From: Jere @ 2017-05-13 20:25 UTC (permalink / raw)
On Thursday, May 11, 2017 at 3:36:09 AM UTC-4, Dmitry A. Kazakov wrote:
> On 11/05/2017 00:56, Jere wrote:
> > On Wednesday, May 10, 2017 at 1:26:00 AM UTC-4, J-P. Rosen wrote:
> >> Le 10/05/2017 à 02:51, Jere a écrit :
> >>> Is there a method besides Atomic? If I am implementing a generic FIFO (lock
> >>> free) and the FIFO elements are complex data types that may not be able to be
> >>> atomic, do I have any other options or are protected objects my only way out?
> >>
> >> Protected objects, or for a lower level, Synchronous_Task_Control. I
> >> know that you want lock free algorithms, but out of curiosity, did you
> >> measure the cost of locking algorithms? IOW, is all of this worth the
> >> trouble?
>
> > I wasn't necessarily aiming for a specific speed/cost. I know I can easily
> > find a standard task safe container and use that and it might be faster (or at
> > least fast enough) on various platforms. This is more for a different option
> > from the normal. It might be faster on some platforms or slower, but it is
> > mostly meant to be a tool in toolkit.
>
> <SNIPPED>
>
> So, Jere, when you become ARG (:-)), please, push for requirement to
> support Atomic for all scalar objects. If the machine lacks
> corresponding instructions the compiler must fall back to a protected
> object. And conversely, if there is a strong case that using a protected
> action might be more efficient than machine's atomic access
> instructions, the compiler should use the former.
>
Well,I believe Randy was offering up to Shark8, but it is definitely the type of
thing I enjoy. I'm very much into semantics when it comes to this type of
stuff, though obviously I am no where near good enough in the Ada standard. Nor
am I experienced enough yet with enough big systems. My experience is mainly in
bare metal applications, where Ada is not very predominant (though not absent).
I am much more comfortable in areas where there are no OS'es and the CPU's are
fairly simple microcontrollers (8 to 32bit single core).
I read through a lot of the posted ARG discussions/minutes, and I did find them
pretty interesting. I found myself disagreeing with some people and agreeing
with others. That said, I am very inexperienced and backseat ARG'ing is easy to
do when I am in my position. It's a difficult job I am sure!
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-10 0:51 ` Jere
2017-05-10 5:25 ` J-P. Rosen
@ 2017-05-10 7:13 ` Dmitry A. Kazakov
2017-05-10 16:45 ` Robert Eachus
` (3 more replies)
2017-05-10 16:38 ` Jeffrey R. Carter
2 siblings, 4 replies; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-10 7:13 UTC (permalink / raw)
On 10/05/2017 02:51, Jere wrote:
> Is there a method besides Atomic? If I am implementing a generic FIFO (lock
> free) and the FIFO elements are complex data types that may not be able to be
> atomic, do I have any other options or are protected objects my only way out?
But you don't need elements atomic only index has to. If I understood
Randy's explanation correctly atomic access to the buffer index will
give you the barrier between index and element accesses which is all
that is required.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-10 7:13 ` Dmitry A. Kazakov
@ 2017-05-10 16:45 ` Robert Eachus
2017-05-10 17:28 ` Simon Wright
2017-05-10 23:21 ` Jere
` (2 subsequent siblings)
3 siblings, 1 reply; 65+ messages in thread
From: Robert Eachus @ 2017-05-10 16:45 UTC (permalink / raw)
On Wednesday, May 10, 2017 at 3:13:27 AM UTC-4, Dmitry A. Kazakov wrote:
> But you don't need elements atomic only index has to. If I understood
> Randy's explanation correctly atomic access to the buffer index will
> give you the barrier between index and element accesses which is all
> that is required.
That is my reading as well, and should result in efficient implementations. Also, for portable code, there are times when Volatile is required. But you really need ancient (in Internet years) hardware for Volatile to have any effect at all. (At least in non-erroneous programs.) In general, Atomic should be implemented as a RMW (read-modify-write) instruction on x86, and will add fences on any hardware that requires them.
The other part, which occurs in this code, is that (user) Atomic variables should only be written by one task/thread. Hmm. Compiler back-ends should probably insert a RMW of a non-program variable to cause synchronization where an Atomic variable is only being read. Of course, this would only eliminate most erroneous executions. Best is for programmers to use protected objects which can correctly (in an interrupt protected way) use RMWs for synchronization.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-10 7:13 ` Dmitry A. Kazakov
2017-05-10 16:45 ` Robert Eachus
@ 2017-05-10 23:21 ` Jere
2017-05-11 0:47 ` Randy Brukardt
2017-05-10 23:30 ` Jere
2017-05-11 0:38 ` Randy Brukardt
3 siblings, 1 reply; 65+ messages in thread
From: Jere @ 2017-05-10 23:21 UTC (permalink / raw)
On Wednesday, May 10, 2017 at 3:13:27 AM UTC-4, Dmitry A. Kazakov wrote:
> On 10/05/2017 02:51, Jere wrote:
>
> > Is there a method besides Atomic? If I am implementing a generic FIFO (lock
> > free) and the FIFO elements are complex data types that may not be able to be
> > atomic, do I have any other options or are protected objects my only way out?
>
> But you don't need elements atomic only index has to. If I understood
> Randy's explanation correctly atomic access to the buffer index will
> give you the barrier between index and element accesses which is all
> that is required.
I may have misunderstood him, but from Randy's response, I gathered that
Volatile has no effect on sequence, only Atomic does. My understanding is that
there are two parts to sequence: compiler reordering and cpu reordering. From
Randy's response I gathered that Atomic guarantees no CPU rerodering between it
and volatile objects (using fences or whatever mechanism provided). If Volatile
has no effect on Compiler reordering sequence, then even if Atomic can put
fences between it and Volatile objects, if the compiler reorders before that
happens, the fence won't help. So I can safely know that the index operations
are safe, but I don't know if any assignment before or after the indexing operation is ok or not since the buffer is only volatile.
In particular, I am basing it off of this response from Randy:
***************************************************
(partial quote)
Atomic implies sequential access for atomic AND volatile objects (note that
volatile does NOT imply that). See 9.10 if you dare. (And if you understand
9.10, please join the ARG, we need you. ;-)
In particular, Atomic implies that any needed memory fences are used.
(Again, volatile alone does not make such a requirement.) The idea is that
one uses atomic objects to protect larger volatile data structures.
***************************************************
So even if Atomic does use fences, if the compiler decides to reorder the
statements before the binary is created, it my not help.
Did I misunderstand his response? I might have.
It has been my own experience that GNAT will prevent "compiler" reordering on
operations on Volatile objects, but that is probably just a GNAT extension.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-10 23:21 ` Jere
@ 2017-05-11 0:47 ` Randy Brukardt
2017-05-13 20:11 ` Jere
0 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-11 0:47 UTC (permalink / raw)
"Jere" <jhb.chat@gmail.com> wrote in message
news:f441cc24-a41e-496e-970f-989a319a2a08@googlegroups.com...
> On Wednesday, May 10, 2017 at 3:13:27 AM UTC-4, Dmitry A. Kazakov wrote:
>> On 10/05/2017 02:51, Jere wrote:
...
> I may have misunderstood him, but from Randy's response, I gathered that
> Volatile has no effect on sequence, only Atomic does.
There are two "sequences" here, one is what the compiler is allowed to
reorder (which is only non-volatile things, remember that atomic implies
volatile), and one is what the CPU is allowed to reorder (which is anything
other than atomic, the compiler has to do whatever is needed to avoid atomic
reordering).
...
> In particular, I am basing it off of this response from Randy:
> ***************************************************
> (partial quote)
> Atomic implies sequential access for atomic AND volatile objects (note
> that
> volatile does NOT imply that). See 9.10 if you dare. (And if you
> understand
> 9.10, please join the ARG, we need you. ;-)
>
> In particular, Atomic implies that any needed memory fences are used.
> (Again, volatile alone does not make such a requirement.) The idea is that
> one uses atomic objects to protect larger volatile data structures.
> ***************************************************
>
> So even if Atomic does use fences, if the compiler decides to reorder the
> statements before the binary is created, it my not help.
>
> Did I misunderstand his response? I might have.
Yes. I didn't say anything about compiler reordering, because it's crystal
clear from C.6(20/5) and C.6(16/3) [and the earlier versions, too] that no
compiler reordering is allowed. [Indeed, C.6(20/5) was too strong,
preventing writing volatile bit-fields if taken literally (you have to do a
pre-read to do a bit-field write, but that would clearly by a "memory read
of a volatime object not specified by the program"), so it had to be
weakened.]
It's the effect on CPU reordering that isn't very clear; the AARM notes
C.6(16.a-d/3) are intended to clarify this, but not everyone reads the AARM.
Randy.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-11 0:47 ` Randy Brukardt
@ 2017-05-13 20:11 ` Jere
2017-05-15 22:33 ` Randy Brukardt
0 siblings, 1 reply; 65+ messages in thread
From: Jere @ 2017-05-13 20:11 UTC (permalink / raw)
On Wednesday, May 10, 2017 at 8:47:23 PM UTC-4, Randy Brukardt wrote:
> "Jere" wrote in message
> > On Wednesday, May 10, 2017 at 3:13:27 AM UTC-4, Dmitry A. Kazakov wrote:
> >> On 10/05/2017 02:51, Jere wrote:
> ...
> > I may have misunderstood him, but from Randy's response, I gathered that
> > Volatile has no effect on sequence, only Atomic does.
>
> There are two "sequences" here, one is what the compiler is allowed to
> reorder (which is only non-volatile things, remember that atomic implies
> volatile), and one is what the CPU is allowed to reorder (which is anything
> other than atomic, the compiler has to do whatever is needed to avoid atomic
> reordering).
> ...
> > In particular, I am basing it off of this response from Randy:
> > ***************************************************
> > (partial quote)
> > Atomic implies sequential access for atomic AND volatile objects (note
> > that
> > volatile does NOT imply that). See 9.10 if you dare. (And if you
> > understand
> > 9.10, please join the ARG, we need you. ;-)
> >
> > In particular, Atomic implies that any needed memory fences are used.
> > (Again, volatile alone does not make such a requirement.) The idea is that
> > one uses atomic objects to protect larger volatile data structures.
> > ***************************************************
> >
> > So even if Atomic does use fences, if the compiler decides to reorder the
> > statements before the binary is created, it my not help.
> >
> > Did I misunderstand his response? I might have.
>
> Yes. I didn't say anything about compiler reordering, because it's crystal
> clear from C.6(20/5) and C.6(16/3) [and the earlier versions, too] that no
> compiler reordering is allowed. [Indeed, C.6(20/5) was too strong,
> preventing writing volatile bit-fields if taken literally (you have to do a
> pre-read to do a bit-field write, but that would clearly by a "memory read
> of a volatime object not specified by the program"), so it had to be
> weakened.]
>
> It's the effect on CPU reordering that isn't very clear; the AARM notes
> C.6(16.a-d/3) are intended to clarify this, but not everyone reads the AARM.
>
> Randy.
Ok, that makes sense. I reread 1.1.3 on external effects (referenced from
C.6(20)), and I think that helps clarify a lot of this for me. Thanks!
I played around with it a bit more. I feel like the RM is very piece-meal with
how it defines things. I have to find a bunch of different sections to piece
together a full picture on how something is defined. Not that it is wrong in
any sense, but it definitely makes it challenging for someone who doesn't yet
"know" the standard. I'll get there in time. Still it's helpful that the RM is
in html format with links, though sometimes there are places lacking links to
definitions (again, to things that once you "know" the standard make sense, but
for someone new it can be a bit deep to wade through, especially if you are used
to different terminology from other language standards).
Again, thanks! Your responses helped clear it up.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-13 20:11 ` Jere
@ 2017-05-15 22:33 ` Randy Brukardt
0 siblings, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-15 22:33 UTC (permalink / raw)
"Jere" <jhb.chat@gmail.com> wrote in message
news:a0357a77-39dd-4e24-8580-a71a9ca79531@googlegroups.com...
...
> I played around with it a bit more. I feel like the RM is very piece-meal
> with
> how it defines things. I have to find a bunch of different sections to
> piece
> together a full picture on how something is defined. Not that it is wrong
> in
> any sense, but it definitely makes it challenging for someone who doesn't
> yet
> "know" the standard. I'll get there in time. Still it's helpful that the
> RM is
> in html format with links, though sometimes there are places lacking links
> to
> definitions (again, to things that once you "know" the standard make
> sense, but
> for someone new it can be a bit deep to wade through, especially if you
> are used
> to different terminology from other language standards).
We intend that every defined term appear in the index, so I'd look there if
all else fails. (Certainly before starting a search.) The real danger point
is things (like "part") that are technical terms but don't seem like it.
BTW, Atomic/Volatile are messy as they're defined in a specialized needs
annex, so part of the definition is in the core language and part in the
annex. (It's hard to imagine an Ada compiler not implementing most of annex
C, though - I don't think than Atomic/Volatile should have been there but it
would be messy to change.)
Randy.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-10 7:13 ` Dmitry A. Kazakov
2017-05-10 16:45 ` Robert Eachus
2017-05-10 23:21 ` Jere
@ 2017-05-10 23:30 ` Jere
2017-05-11 0:38 ` Randy Brukardt
3 siblings, 0 replies; 65+ messages in thread
From: Jere @ 2017-05-10 23:30 UTC (permalink / raw)
On Wednesday, May 10, 2017 at 3:13:27 AM UTC-4, Dmitry A. Kazakov wrote:
> On 10/05/2017 02:51, Jere wrote:
>
> > Is there a method besides Atomic? If I am implementing a generic FIFO (lock
> > free) and the FIFO elements are complex data types that may not be able to be
> > atomic, do I have any other options or are protected objects my only way out?
>
> But you don't need elements atomic only index has to. If I understood
> Randy's explanation correctly atomic access to the buffer index will
> give you the barrier between index and element accesses which is all
> that is required.
Followup to my previous email:
Perhaps it is easier for me to explain what I think is happening:
Key
Type => Prevents Compiler Reordering => Prevents CPU Reordering
After reading the various responses I am under the impression that:
Volatile => NO => NO
Atomic => YES => YES
I had hoped it was
Volatile => YES => NO
Atomic => YES => YES
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-10 7:13 ` Dmitry A. Kazakov
` (2 preceding siblings ...)
2017-05-10 23:30 ` Jere
@ 2017-05-11 0:38 ` Randy Brukardt
3 siblings, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-11 0:38 UTC (permalink / raw)
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
news:oeueil$knn$1@gioia.aioe.org...
> On 10/05/2017 02:51, Jere wrote:
>
>> Is there a method besides Atomic? If I am implementing a generic FIFO
>> (lock
>> free) and the FIFO elements are complex data types that may not be able
>> to be
>> atomic, do I have any other options or are protected objects my only way
>> out?
>
> But you don't need elements atomic only index has to. If I understood
> Randy's explanation correctly atomic access to the buffer index will give
> you the barrier between index and element accesses which is all that is
> required.
Correct. The compiler can't motion (or let the CPU motion) volatile
reads/writes across an atomic read/write. Thus you can depend on the order
of volatile operations so long as an atomic operation is interleaved.
Randy.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-10 0:51 ` Jere
2017-05-10 5:25 ` J-P. Rosen
2017-05-10 7:13 ` Dmitry A. Kazakov
@ 2017-05-10 16:38 ` Jeffrey R. Carter
2017-05-10 23:40 ` Jere
2 siblings, 1 reply; 65+ messages in thread
From: Jeffrey R. Carter @ 2017-05-10 16:38 UTC (permalink / raw)
On 05/10/2017 02:51 AM, Jere wrote:
>
> Is there a method besides Atomic? If I am implementing a generic FIFO (lock
> free) and the FIFO elements are complex data types that may not be able to be
> atomic, do I have any other options or are protected objects my only way out?
If your indices are Atomic and your array has Volatile_Components, then I think
things will work the way you want them to. The important reference here is ARM
1.1.3, which says
* Any read or update of an atomic or volatile object is part of the program's
external interactions.
* An Ada compiler must produce a program with external interactions in the order
and timing specified by Ada's semantics.
Ada's semantics for a sequence of statements say the statements are executed in
order.
So, if you have in task A
A (I + 1) := Item;
I := I + 1;
Then from the outside an observer must see in order
1. Read I
2. Write A (I + 1)
3. Read I
4. Write I
If task B does
if I > 0 then
then B's "Read I" will also be something the external observer can see. It might
happen before 1., between 1. and 3., between 3. and 4., or after 4., but won't
happen during any of 1, 3, or 4. That guarantees that B will always get a valid
view of I.
Thanks to Holsti for pointing this out [through ARM C.6(20)]. My previous
understanding of this was incorrect.
--
Jeff Carter
"Monsieur Arthur King, who has the brain of a duck, you know."
Monty Python & the Holy Grail
09
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-10 16:38 ` Jeffrey R. Carter
@ 2017-05-10 23:40 ` Jere
0 siblings, 0 replies; 65+ messages in thread
From: Jere @ 2017-05-10 23:40 UTC (permalink / raw)
On Wednesday, May 10, 2017 at 12:38:54 PM UTC-4, Jeffrey R. Carter wrote:
> On 05/10/2017 02:51 AM, Jere wrote:
> >
> > Is there a method besides Atomic? If I am implementing a generic FIFO (lock
> > free) and the FIFO elements are complex data types that may not be able to be
> > atomic, do I have any other options or are protected objects my only way out?
>
> If your indices are Atomic and your array has Volatile_Components, then I think
> things will work the way you want them to. The important reference here is ARM
> 1.1.3, which says
>
> * Any read or update of an atomic or volatile object is part of the program's
> external interactions.
>
> * An Ada compiler must produce a program with external interactions in the order
> and timing specified by Ada's semantics.
>
> Ada's semantics for a sequence of statements say the statements are executed in
> order.
>
> So, if you have in task A
>
> A (I + 1) := Item;
> I := I + 1;
>
> Then from the outside an observer must see in order
>
> 1. Read I
> 2. Write A (I + 1)
> 3. Read I
> 4. Write I
>
> If task B does
>
> if I > 0 then
>
> then B's "Read I" will also be something the external observer can see. It might
> happen before 1., between 1. and 3., between 3. and 4., or after 4., but won't
> happen during any of 1, 3, or 4. That guarantees that B will always get a valid
> view of I.
Thanks!
My previous couple of emails might be moot now that I have read this response.
Here I get the impression that Volatile does prevent Compiler statement
reordering. If that is the case, then I know between that and any fencing put
up by Atomic objects will be sufficient.
So at this point my understanding is now:
Type => Prevents Compiler Reordering => Prevents CPU Reordering
Volatile => YES => NO
Atomic => YES => YES
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 19:57 ` Randy Brukardt
2017-05-10 0:51 ` Jere
@ 2017-05-10 16:19 ` Robert Eachus
2017-05-11 1:02 ` Randy Brukardt
1 sibling, 1 reply; 65+ messages in thread
From: Robert Eachus @ 2017-05-10 16:19 UTC (permalink / raw)
On Tuesday, May 9, 2017 at 3:57:50 PM UTC-4, Randy Brukardt wrote:
> "Robert Eachus" <rieachus@comcast.net> wrote in message
> news:077e7f6a-5a7b-4b88-a16f-7672aec18a17@googlegroups.com...
> ...
> >The compiler will play some games, and certainly will leave Write_Index in
> >a register if possible.
>
> It better not is Write_Index is Atomic.
First, if FIFO has Atomic_Components, then it is not possible. Atomic_Components would be overkill, but I haven't seen those declarations here. If FIFO has just Volatile_Components (or no attribute) then this is possible. But what is "this"?
>
Sloppy wording, I should have said "keep a copy of Write_Index". The write to Write_Index will occur to cache on modern systems where the whole system is cache-coherent. But that is a detail. What I was specifically referring to was that there are two subexpressions Write_Index+1. As far as I can tell, if there are no synchronization points between them, the (overall system including) the compiler can do one read of Write_Index, then evaluate the expression once--or twice.
Why evaluate twice? (Other than the write to FIFO being a synchronization point?) Hardware magic may combine the increment with the move, through a postincrement addressing mode. It would then write the incremented register copy to Write_Index. One read, one write. Same as if it did common subexpression elimination. It doesn't need to READ Write_Access twice. I hope I'm right, because anything else is silly. If a synchronization point did occur between the two reads, then the in register copy of Write_Index, or Write_Index + 1, if the implementation wants to keep that around instead, must be spoiled.
What if Write_Address is assigned to a special address which other hardware can access? Doesn't matter. Think of the cache system as treating addresses as names, and moving them around. As long as everyone sees the same version of that name, the contents of the location named are the only reality we can be concerned with. Since CPUs now contain the (physical) memory managers, as well as cache, the "external effect" of a program has to be interpreted as the view of memory offered by the CPU. Nothing else is available. (Writes to disk, or to the display screen are all memory writes from the CPU's viewpoint.)
There is another issue here, which we need to allow compilers to get right. Assume that the records in FIFO are X bytes long. A common compiler optimization is to use a value FIFO'Address + X x Write_Index instead of Write_Index to do the writes. Is that legal? Or better, under which attributes of FIFO and Write_Index is the system allowed to do that optimization?
> Only the compiler vendor need worry about this level of discussion.
> Honestly, I doubt anyone else is qualified. (I'm not, and I AM a compiler
> vendor... ;-)
I used to be a compiler writer, on at least three Ada compilers, but the deep level understanding of what is going on comes from my work on radar systems. There access times in microseconds are important, and sometimes in nanoseconds. (I think the worst real-time requirement I ever saw was for 200 nanoseconds. Ouch!)
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-10 16:19 ` Robert Eachus
@ 2017-05-11 1:02 ` Randy Brukardt
2017-05-11 1:51 ` Robert Eachus
0 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-11 1:02 UTC (permalink / raw)
"Robert Eachus" <rieachus@comcast.net> wrote in message
news:df31d3f8-636f-4702-affe-d5a0aa0c898a@googlegroups.com...
On Tuesday, May 9, 2017 at 3:57:50 PM UTC-4, Randy Brukardt wrote:
> "Robert Eachus" <rieachus@comcast.net> wrote in message
> news:077e7f6a-5a7b-4b88-a16f-7672aec18a17@googlegroups.com...
> ...
>> >The compiler will play some games, and certainly will leave Write_Index
>> >in
>> >a register if possible.
>>
>> It better not is Write_Index is Atomic.
>
>First, if FIFO has Atomic_Components, then it is not possible.
>Atomic_Components would be
>overkill, but I haven't seen those declarations here. If FIFO has just
>Volatile_Components (or
>no attribute) then this is possible. But what is "this"?
????
Write_Index needs to be atomic in order for the OP's algorithm to work.
Atomic objects fall under C.6(20/5) -- all reads and updates are included in
the external effect of the program. Optimizing them out (in any way) is
wrong (excepting of course a true as-if optimization, but that doesn't make
any sense in this case). In particular, saving the value in a register is
useless, because you still have to read the object again at the next
reference.
>Sloppy wording, I should have said "keep a copy of Write_Index". The write
>to Write_Index
>will occur to cache on modern systems where the whole system is
>cache-coherent. But that is
>a detail. What I was specifically referring to was that there are two
>subexpressions
>Write_Index+1. As far as I can tell, if there are no synchronization
>points between them, the
>(overall system including) the compiler can do one read of Write_Index,
>then evaluate the
>expression once--or twice.
Definitely not (again assuming Write_Index is atomic). One has to make the
reads of memory for an atomic object, saving copies is wrong. And there is
no real value to eliminating the +1 -- in-register operations are
essentially free on modern architechtures.
Since the rest of your discussion is based on an incorrect premise, I'm
going to skip it. (Other than to say, if you need the fastest possible code,
don't make anything volatile or atomic! They're for synchronization, which
necessarily is going to be slower than the usual code, since it has to be
predictable.)
...
>> Only the compiler vendor need worry about this level of discussion.
>> Honestly, I doubt anyone else is qualified. (I'm not, and I AM a compiler
>> vendor... ;-)
>
>I used to be a compiler writer, on at least three Ada compilers, but the
>deep level
>understanding of what is going on comes from my work on radar systems.
>There
>access times in microseconds are important, and sometimes in nanoseconds.
>(I think
>the worst real-time requirement I ever saw was for 200 nanoseconds. Ouch!)
I know that (and so am I), but my complaint was that the discussion isn't
helpful to the OPs question (which is how to do this in hopefully portable
Ada); it's the compiler's job to get this right and one hopes that the only
time anyone needs to worry about this is if they have a bug that they think
is caused by the compiler doing this wrong. (And this is when AdaCore
support contracts pay for themselves, because you can dump the problem on
them and let them figure it out. And it's the sort of problem that makes
those contracts so expensive, because it is HARD work!!)
Randy.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-11 1:02 ` Randy Brukardt
@ 2017-05-11 1:51 ` Robert Eachus
2017-05-15 22:45 ` Randy Brukardt
0 siblings, 1 reply; 65+ messages in thread
From: Robert Eachus @ 2017-05-11 1:51 UTC (permalink / raw)
On Wednesday, May 10, 2017 at 9:02:22 PM UTC-4, Randy Brukardt wrote:
> Write_Index needs to be atomic in order for the OP's algorithm to work.
> Atomic objects fall under C.6(20/5) -- all reads and updates are included in
> the external effect of the program. Optimizing them out (in any way) is
> wrong (excepting of course a true as-if optimization, but that doesn't make
> any sense in this case). In particular, saving the value in a register is
> useless, because you still have to read the object again at the next
> reference.
Even if there is no intervening read or write of anything? It seems to me that must be an "as if," since there is no way to prove otherwise. You may be thinking of reads of memory external to the CPU, part of some other hardware, where the other hardware can intervene.
Ah, the missing attribute: Uncacheable. Once upon a time, volatile implied that, but not for many years. I guess there are still (embedded) processors out there where reading from a (memory mapped I/O) location changes its value, so I guess that is an issue for Ada, but only when running on bare metal.
What about spin-locks? They have to have another action between reads to work. Well you could pack all of a read, a test and say a jump non-zero all in one clock cycle. But then there would be an extra read since the test would be on the prior version of the flags. Put it into two clocks and the CPU can't be sure that the next read will occur. In other words, it has to do the right number of reads which can affect flags, and no more.
Hmm. You need to read the (x86 and most other) rules for prefetching memory locations. As I stated earlier in this thread, any CPU core is allowed to do a prefetch of any location, as long as it does not cause an action like an interrupt. This is necessary, or you wouldn't be able to have any Atomic accesses. The CPU reads an entire cache line, usually 64, 128, or 256 bytes into the cache, but even the widest of CPU writes are only 64 bytes wide, and those are multiple floating-point values.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-11 1:51 ` Robert Eachus
@ 2017-05-15 22:45 ` Randy Brukardt
0 siblings, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-15 22:45 UTC (permalink / raw)
"Robert Eachus" <rieachus@comcast.net> wrote in message
news:fcf4a91a-bfdf-44ee-a888-7bb9a1c52b48@googlegroups.com...
On Wednesday, May 10, 2017 at 9:02:22 PM UTC-4, Randy Brukardt wrote:
>> Write_Index needs to be atomic in order for the OP's algorithm to work.
>> Atomic objects fall under C.6(20/5) -- all reads and updates are included
>> in
>> the external effect of the program. Optimizing them out (in any way) is
>> wrong (excepting of course a true as-if optimization, but that doesn't
>> make
>> any sense in this case). In particular, saving the value in a register is
>> useless, because you still have to read the object again at the next
>> reference.
>Even if there is no intervening read or write of anything?
Yes.
>It seems to me that must be an "as if," since there is no way to prove
>otherwise.
>You may be thinking of reads of memory external to the CPU, part of some
>other hardware, where the other hardware can intervene.
I'm not "thinking" that, the language defines it that way (it's part of the
external effect of the program, C.6(20) and 1.1.3(8-14)). I don't know how a
compiler could possibly know if someone has a way to see the external
effects (some CPU monitor, active memory, etc.), so one has to assume that
they CAN tell. And in any case, the intent of the language designers is
crystal-clear -- ignore that at your peril.
...
>Hmm. You need to read the (x86 and most other) rules for prefetching memory
>locations.
>As I stated earlier in this thread, any CPU core is allowed to do a
>prefetch of any location,
>as long as it does not cause an action like an interrupt. This is
>necessary, or you wouldn't
>be able to have any Atomic accesses. The CPU reads an entire cache line,
>usually 64, 128,
>or 256 bytes into the cache, but even the widest of CPU writes are only 64
>bytes wide, and
>those are multiple floating-point values.
I have the compiler generate the "correct" code and don't worry much about
what the CPU does to it. (That's one reason I was completely in favor of
removing the bogus requirement to read/write all the way to memory -- it
doesn't make sense on CPUs that do caching.) That's out of my hands in any
case as a software guy. (Refer to my response to Jere.) There's a
requirement to prevent CPU reordering, but nothing else.
But I wouldn't be implementing the language if I started ignoring the
requirements on the compiler by leaning on an as-if optimization, especially
one that can be detected with active hardware. Especially as if I did so,
there would be no possibility of ever writing portable synchonization
code...
Randy.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-08 16:22 ` Dmitry A. Kazakov
2017-05-08 18:39 ` Robert Eachus
2017-05-08 19:18 ` Robert Eachus
@ 2017-05-08 20:29 ` Niklas Holsti
2017-05-08 21:09 ` Dmitry A. Kazakov
2017-05-09 20:10 ` Randy Brukardt
2017-05-09 0:05 ` Jere
2017-05-09 19:53 ` Randy Brukardt
4 siblings, 2 replies; 65+ messages in thread
From: Niklas Holsti @ 2017-05-08 20:29 UTC (permalink / raw)
On 17-05-08 19:22 , Dmitry A. Kazakov wrote:
[snip]
> I don't see anything there what would prevent CPU from changing the
> order of accesses to two *independent* memory locations when only the
> program logic ties them nothing else. Well, except that there is no
> obvious reason why it would like to change that order.
I quote from an earlier post of mine in this thread:
Just as in C, volatile accesses in Ada are guaranteed to occur in the
order and number written in the source code and executed in the
"standard" order [moreover, this includes all tasks in the program]:
RM C.6 (16/3): All tasks of the program (on all processors) that read or
update volatile variables see the same order of updates to the
variables. A use of an atomic variable or other mechanism may be
necessary to avoid erroneous execution and to ensure that access to
nonatomic volatile variables is sequential (see 9.10).
> Yet the question stands. In particular, which Ada primitives may ensure
> safe implementations of lock-free structures when no protected objects
> used.
Pragma Volatile + RM C.6 (16/3)? If these are not sufficient, why not?
--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-08 20:29 ` Niklas Holsti
@ 2017-05-08 21:09 ` Dmitry A. Kazakov
2017-05-09 4:34 ` Niklas Holsti
2017-05-09 20:10 ` Randy Brukardt
1 sibling, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-08 21:09 UTC (permalink / raw)
On 2017-05-08 22:29, Niklas Holsti wrote:
> On 17-05-08 19:22 , Dmitry A. Kazakov wrote:
>
> [snip]
>
>> I don't see anything there what would prevent CPU from changing the
>> order of accesses to two *independent* memory locations when only the
>> program logic ties them nothing else. Well, except that there is no
>> obvious reason why it would like to change that order.
>
> I quote from an earlier post of mine in this thread:
>
> Just as in C, volatile accesses in Ada are guaranteed to occur in the
> order and number written in the source code and executed in the
> "standard" order [moreover, this includes all tasks in the program]:
>
> RM C.6 (16/3): All tasks of the program (on all processors) that read or
> update volatile variables see the same order of updates to the
> variables.
This is ambiguous. Does it read as:
A. Order of updates per volatile variable
B. Order of updates to the set of all volatile variables
B is sufficient, A is not. Depending on the implementation A may imply B
or not.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-08 21:09 ` Dmitry A. Kazakov
@ 2017-05-09 4:34 ` Niklas Holsti
2017-05-09 6:16 ` Niklas Holsti
0 siblings, 1 reply; 65+ messages in thread
From: Niklas Holsti @ 2017-05-09 4:34 UTC (permalink / raw)
On 17-05-09 00:09 , Dmitry A. Kazakov wrote:
> On 2017-05-08 22:29, Niklas Holsti wrote:
>> On 17-05-08 19:22 , Dmitry A. Kazakov wrote:
>>
>> [snip]
>>
>>> I don't see anything there what would prevent CPU from changing the
>>> order of accesses to two *independent* memory locations when only the
>>> program logic ties them nothing else. Well, except that there is no
>>> obvious reason why it would like to change that order.
>>
>> I quote from an earlier post of mine in this thread:
>>
>> Just as in C, volatile accesses in Ada are guaranteed to occur in the
>> order and number written in the source code and executed in the
>> "standard" order [moreover, this includes all tasks in the program]:
>>
>> RM C.6 (16/3): All tasks of the program (on all processors) that read or
>> update volatile variables see the same order of updates to the
>> variables.
>
> This is ambiguous. Does it read as:
>
> A. Order of updates per volatile variable
>
> B. Order of updates to the set of all volatile variables
>
> B is sufficient, A is not. Depending on the implementation A may imply B
> or not.
To me it is clear that B is meant, because the text uses the plural
"variables".
If A were meant, the text would say "... that read or update a volatile
variable see the same order of updates to the variable".
Also, remember that accesses to volatile variables are defined to be
part of the "external effect" of the program, and all of the external
effect must occur in the canonical order defined by the Ada semantics,
without any reordering by compiler or processor.
--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 4:34 ` Niklas Holsti
@ 2017-05-09 6:16 ` Niklas Holsti
2017-05-09 8:34 ` Dmitry A. Kazakov
2017-05-09 20:18 ` Randy Brukardt
0 siblings, 2 replies; 65+ messages in thread
From: Niklas Holsti @ 2017-05-09 6:16 UTC (permalink / raw)
On 17-05-09 07:34 , Niklas Holsti wrote:
> On 17-05-09 00:09 , Dmitry A. Kazakov wrote:
>> On 2017-05-08 22:29, Niklas Holsti wrote:
>>> On 17-05-08 19:22 , Dmitry A. Kazakov wrote:
>>>
>>> [snip]
>>>
>>>> I don't see anything there what would prevent CPU from changing the
>>>> order of accesses to two *independent* memory locations when only the
>>>> program logic ties them nothing else. Well, except that there is no
>>>> obvious reason why it would like to change that order.
>>>
>>> I quote from an earlier post of mine in this thread:
>>>
>>> Just as in C, volatile accesses in Ada are guaranteed to occur in the
>>> order and number written in the source code and executed in the
>>> "standard" order [moreover, this includes all tasks in the program]:
>>>
>>> RM C.6 (16/3): All tasks of the program (on all processors) that read or
>>> update volatile variables see the same order of updates to the
>>> variables.
>>
>> This is ambiguous. Does it read as:
>>
>> A. Order of updates per volatile variable
>>
>> B. Order of updates to the set of all volatile variables
>>
>> B is sufficient, A is not. Depending on the implementation A may imply B
>> or not.
>
> To me it is clear that B is meant, because the text uses the plural
> "variables".
>
> If A were meant, the text would say "... that read or update a volatile
> variable see the same order of updates to the variable".
>
> Also, remember that accesses to volatile variables are defined to be
> part of the "external effect" of the program, and all of the external
> effect must occur in the canonical order defined by the Ada semantics,
> without any reordering by compiler or processor.
Hmm... there may be some support for interpretation A in the Annotated RM.
In the Annotated RM, the following notes follow C.6 (16/3) and IMO
support the view that this rule is meant to ensure a consistent access
order over all volatile variables, that is, interpretation B (note also
that some sentences speak or variables in plural, others speak of a
single variable, suggesting that the choice of plural or singular is
intentional and carries meaning):
- 16.a/3 Implementation Note: {AI05-0117-1} {AI05-0275-1} To ensure
this, on a multiprocessor, any read or update of an atomic object may
require the use of an appropriate memory barrier.
- 16.b/3 Discussion: {AI05-0275-1} From 9.10 it follows that (in
non-erroneous programs) accesses to variables, including those shared by
multiple tasks, are always sequential. This guarantees that no task will
ever see partial updates of any variable. For volatile variables
(including atomic variables), the above rule additionally specifies that
all tasks see the same order of updates.
- 16.c/3 {AI05-0275-1} If for a shared variable X, a read of X occurs
sequentially after an update of X, then the read will return the updated
value if X is volatile or atomic, but may or or may not return the
updated value if X is nonvolatile. For nonvolatile accesses, a signaling
action is needed in order to share the updated value.
("Sequentially" above refers to RM 9.10, I believe.)
On the other hand, the final note on (16/3) in the ARM, below, speaks of
the order imposed by accesses to a _single_ atomic variable ("the same
atomic variable"), which may suggest interpretation A:
- 16.d/3 {AI05-0275-1} Because accesses to the same atomic variable by
different tasks establish a sequential order between the actions of
those tasks, implementations may be required to emit memory barriers
around such updates or use atomic instructions that imply such barriers.
It would be nice to hear from some ARG member re the correct
interpretation of RM C.6 (16/3), as this seems to define whether pragma
Volatile is enough to let us implement lock-free synchronisation routines.
--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 6:16 ` Niklas Holsti
@ 2017-05-09 8:34 ` Dmitry A. Kazakov
2017-05-09 20:18 ` Randy Brukardt
1 sibling, 0 replies; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-09 8:34 UTC (permalink / raw)
On 09/05/2017 08:16, Niklas Holsti wrote:
> On 17-05-09 07:34 , Niklas Holsti wrote:
>> On 17-05-09 00:09 , Dmitry A. Kazakov wrote:
>>> On 2017-05-08 22:29, Niklas Holsti wrote:
>>>> On 17-05-08 19:22 , Dmitry A. Kazakov wrote:
>>>>
>>>> [snip]
>>>>
>>>>> I don't see anything there what would prevent CPU from changing the
>>>>> order of accesses to two *independent* memory locations when only the
>>>>> program logic ties them nothing else. Well, except that there is no
>>>>> obvious reason why it would like to change that order.
>>>>
>>>> I quote from an earlier post of mine in this thread:
>>>>
>>>> Just as in C, volatile accesses in Ada are guaranteed to occur in the
>>>> order and number written in the source code and executed in the
>>>> "standard" order [moreover, this includes all tasks in the program]:
>>>>
>>>> RM C.6 (16/3): All tasks of the program (on all processors) that
>>>> read or
>>>> update volatile variables see the same order of updates to the
>>>> variables.
>>>
>>> This is ambiguous. Does it read as:
>>>
>>> A. Order of updates per volatile variable
>>>
>>> B. Order of updates to the set of all volatile variables
>>>
>>> B is sufficient, A is not. Depending on the implementation A may imply B
>>> or not.
>>
>> To me it is clear that B is meant, because the text uses the plural
>> "variables".
>>
>> If A were meant, the text would say "... that read or update a volatile
>> variable see the same order of updates to the variable".
>>
>> Also, remember that accesses to volatile variables are defined to be
>> part of the "external effect" of the program, and all of the external
>> effect must occur in the canonical order defined by the Ada semantics,
>> without any reordering by compiler or processor.
>
> Hmm... there may be some support for interpretation A in the Annotated RM.
>
> In the Annotated RM, the following notes follow C.6 (16/3) and IMO
> support the view that this rule is meant to ensure a consistent access
> order over all volatile variables, that is, interpretation B (note also
> that some sentences speak or variables in plural, others speak of a
> single variable, suggesting that the choice of plural or singular is
> intentional and carries meaning):
>
> - 16.a/3 Implementation Note: {AI05-0117-1} {AI05-0275-1} To ensure
> this, on a multiprocessor, any read or update of an atomic object may
> require the use of an appropriate memory barrier.
>
> - 16.b/3 Discussion: {AI05-0275-1} From 9.10 it follows that (in
> non-erroneous programs) accesses to variables, including those shared by
> multiple tasks, are always sequential. This guarantees that no task will
> ever see partial updates of any variable. For volatile variables
> (including atomic variables), the above rule additionally specifies that
> all tasks see the same order of updates.
>
> - 16.c/3 {AI05-0275-1} If for a shared variable X, a read of X occurs
> sequentially after an update of X, then the read will return the updated
> value if X is volatile or atomic, but may or or may not return the
> updated value if X is nonvolatile. For nonvolatile accesses, a signaling
> action is needed in order to share the updated value.
>
> ("Sequentially" above refers to RM 9.10, I believe.)
>
> On the other hand, the final note on (16/3) in the ARM, below, speaks of
> the order imposed by accesses to a _single_ atomic variable ("the same
> atomic variable"), which may suggest interpretation A:
>
> - 16.d/3 {AI05-0275-1} Because accesses to the same atomic variable by
> different tasks establish a sequential order between the actions of
> those tasks, implementations may be required to emit memory barriers
> around such updates or use atomic instructions that imply such barriers.
>
> It would be nice to hear from some ARG member re the correct
> interpretation of RM C.6 (16/3), as this seems to define whether pragma
> Volatile is enough to let us implement lock-free synchronisation routines.
The interpretation B would mean that having pragma Atomic on the read
and write indices should be sufficient for the FIFO implementation. And
it will be portable.
P.S. It would be a great help to have a hard requirement in ARM to
support Atomic for all scalar types, e.g. Stream_Element_Offset.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 6:16 ` Niklas Holsti
2017-05-09 8:34 ` Dmitry A. Kazakov
@ 2017-05-09 20:18 ` Randy Brukardt
1 sibling, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-09 20:18 UTC (permalink / raw)
"Niklas Holsti" <niklas.holsti@tidorum.invalid> wrote in message
news:end561F9mocU1@mid.individual.net...
...
> It would be nice to hear from some ARG member re the correct
> interpretation of RM C.6 (16/3), as this seems to define whether pragma
> Volatile is enough to let us implement lock-free synchronisation routines.
(You shouldn't have these discussions while I'm on vacation. ;-)
No.
You miss the point about "non-erroneous programs" in the AARM note. Almost
all accesses to shared variables by multiple tasks is erroneous, unless
something makes them sequential. That something can be a rendezvous, a
protected action, or an atomic object read/write. Without one of these,
volatile guarentees nothing.
AARM C.6(16.c/3) was my attempt to explain the difference between volatile
and non-volatile, since in most instances there isn't one. (The main answer
is that a volatile object can be trusted in another task after an access to
an atomic object, and a regular object cannot; otherwise they're identical.)
AARM C.6(16.d/3) answers the OPs question: uses of an atomic object implies
barriers/fences if one is needed.
Note that volatile as a second, separate meaning that should never have been
mixed in to the tasking one (that is, access to memory-mapped
hardware/software). That has different requirements at this sort of edge.
(There's a proposal on the table to split volatile into parts to clarify
which kind of use is intended.)
Randy.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-08 20:29 ` Niklas Holsti
2017-05-08 21:09 ` Dmitry A. Kazakov
@ 2017-05-09 20:10 ` Randy Brukardt
1 sibling, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-09 20:10 UTC (permalink / raw)
"Niklas Holsti" <niklas.holsti@tidorum.invalid> wrote in message
news:enc2orF3oloU1@mid.individual.net...
...
>> Yet the question stands. In particular, which Ada primitives may ensure
>> safe implementations of lock-free structures when no protected objects
>> used.
>
> Pragma Volatile + RM C.6 (16/3)? If these are not sufficient, why not?
It's very subtle, but you need something atomic involved, else the rule for
volatile is meaningless. It took a long discussion by a (now former) AdaCore
person for me to understand it. Essentially, all shared object accesses
(that is, from different tasks) is erroneous unless something makes the
accesses "sequential" as defined in 9.10. Atomic makes acccesses
"sequential" (thus requiring "fences" on some CPUs), volatile does not.
The second sentence of C.6(16/3) is trying to point this out.
Since not everything can be atomic, you usually have to use some sort of
mixed atomic and volatile coding.
Again, as I've warned multiple times, this is untestable by the ACATS, so
you have to go totally on trust that the vendor has implemented it right.
Eachus's lengthy discussion might be interesting if you are implementing an
Ada compiler, but otherwise, just use the Ada tools provided and trust that
your vendor gets them right. There's really no other choice because the odds
of your getting it right when the vendor didn't is very low (as this stuff
is at the edge of human understanding).
Randy.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-08 16:22 ` Dmitry A. Kazakov
` (2 preceding siblings ...)
2017-05-08 20:29 ` Niklas Holsti
@ 2017-05-09 0:05 ` Jere
2017-05-09 8:26 ` Dmitry A. Kazakov
2017-05-09 19:53 ` Randy Brukardt
4 siblings, 1 reply; 65+ messages in thread
From: Jere @ 2017-05-09 0:05 UTC (permalink / raw)
On Monday, May 8, 2017 at 12:22:55 PM UTC-4, Dmitry A. Kazakov wrote:
> For practical use when the buffer size is > 1 there cannot be any problem.
I saw you mention this earlier. Why does the buffer size prevent that problem?
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 0:05 ` Jere
@ 2017-05-09 8:26 ` Dmitry A. Kazakov
0 siblings, 0 replies; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-09 8:26 UTC (permalink / raw)
On 09/05/2017 02:05, Jere wrote:
> On Monday, May 8, 2017 at 12:22:55 PM UTC-4, Dmitry A. Kazakov wrote:
>> For practical use when the buffer size is > 1 there cannot be any problem.
>
> I saw you mention this earlier. Why does the buffer size prevent that problem?
I did not analyzed all variants, maybe I am wrong. The idea is that the
modified index is used only for tests. When reading occur at another
index then premature increment of the former does not matter.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-08 16:22 ` Dmitry A. Kazakov
` (3 preceding siblings ...)
2017-05-09 0:05 ` Jere
@ 2017-05-09 19:53 ` Randy Brukardt
2017-05-09 20:27 ` Dmitry A. Kazakov
4 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-09 19:53 UTC (permalink / raw)
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
news:oeq60s$18c1$1@gioia.aioe.org...
> On 2017-05-08 17:56, Robert Eachus wrote:
>> One sentence summary: The CPU will play lots of tricks behind your
>> back, but in such a way you can pretend your code works as you intended.
>> (Er, after you get the bugs out.)
>>
>> On Monday, May 8, 2017 at 3:45:26 AM UTC-4, Dmitry A. Kazakov wrote:
>>
>>> I don't believe either of the sequences can get reordered by the CPU,
>>> but it is only belief and common sense.
>>
>> Silly rabbit, Trix are for kids. Oops, silly programmer, common
>> sense has nothing to do with modern CPUs.
> [...]
>> Is any of this important to a high-level language programmer? Not
>> really, unless they are working in hard real time. (Which I did.) If you
>> program in assembler, or more likely work on a compiler back-end,
>> knowing which instructions fit together is important. Even more
>> important is coding so that an instruction that references a new cache
>> line in memory is followed by as many instructions as possible that
>> don't use that value. Some times you run into cases where you want to
>> take advantage of write combining or some other trick involving the
>> write pipe--but today, smart compiler optimization is about accesses to
>> main memory and not much else.
>
> I don't see anything there what would prevent CPU from changing the order
> of accesses to two *independent* memory locations when only the program
> logic ties them nothing else. Well, except that there is no obvious reason
> why it would like to change that order.
>
> Wikipedia provides a more simple example of the problem:
>
> https://en.wikipedia.org/wiki/Memory_barrier
>
> The case of a lock-free FIFO is different because it involves index and
> array element.
>
> For practical use when the buffer size is > 1 there cannot be any problem.
>
> Yet the question stands. In particular, which Ada primitives may ensure
> safe implementations of lock-free structures when no protected objects
> used.
Atomic objects make such a guarentee. Nothing else does (well, ignoring
protected objects, of course, since you excluded them). End of story.
(Except that compilers might not implement it right, as I noted before.)
Randy.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 19:53 ` Randy Brukardt
@ 2017-05-09 20:27 ` Dmitry A. Kazakov
2017-05-11 0:35 ` Randy Brukardt
0 siblings, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-09 20:27 UTC (permalink / raw)
On 2017-05-09 21:53, Randy Brukardt wrote:
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
>> Yet the question stands. In particular, which Ada primitives may ensure
>> safe implementations of lock-free structures when no protected objects
>> used.
>
> Atomic objects make such a guarentee. Nothing else does (well, ignoring
> protected objects, of course, since you excluded them). End of story.
Glad to hear that!
> (Except that compilers might not implement it right, as I noted before.)
So far I didn't experience problems with my implementations of lock-free
FIFO and blackboard compiled with GNAT for x86 and ARM7.
You are right noting that this is not testable (OK, one could fully test
it under an emulator in simulated time, but who tests the emulator?).
Yet, surprisingly, on real examples the error shows itself pretty
quickly. The background is that since GNAT does not support pragma
Atomic for 64-bit integers on 32-bit machines I have the implementation
varying, selected by GNAT project scenario. E.g. GCC built-in operations
are used instead of pragma Atomic. When the selected scenario is wrong
the error shows itself almost instantly.
So the conclusion is that GNAT seems OK.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-09 20:27 ` Dmitry A. Kazakov
@ 2017-05-11 0:35 ` Randy Brukardt
2017-05-11 8:24 ` Dmitry A. Kazakov
0 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-11 0:35 UTC (permalink / raw)
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
news:oet8mj$13ri$1@gioia.aioe.org...
> On 2017-05-09 21:53, Randy Brukardt wrote:
...
>> (Except that compilers might not implement it right, as I noted before.)
>
> So far I didn't experience problems with my implementations of lock-free
> FIFO and blackboard compiled with GNAT for x86 and ARM7.
>
> You are right noting that this is not testable (OK, one could fully test
> it under an emulator in simulated time, but who tests the emulator?).
>
> Yet, surprisingly, on real examples the error shows itself pretty quickly.
> The background is that since GNAT does not support pragma Atomic for
> 64-bit integers on 32-bit machines I have the implementation varying,
> selected by GNAT project scenario. E.g. GCC built-in operations are used
> instead of pragma Atomic. When the selected scenario is wrong the error
> shows itself almost instantly.
That doesn't surprise me that much. If you have a lock-free algorithm that
gets used frequently, you're probably going to hit all of the possible
states reasonably soon (assuming thousands/millions of operations get
tested). The problem of course is that failure just causes erroneous
execution, so formally one can't depend on anything that happens -- which of
course includes displaying a failure message.
Still, it would be nice to try something in this area. One could imagine
creating an Annex C test that implemented a lock-free algorithm and just
tried to see if it worked properly. That certainly wouldn't catch
everything, but it would catch some gross errors and at least put a
spotlight on the rules in question. But that isn't my area of expertise;
doing it wrong is definitely worse than not doing it at all (because of time
investment to issue, then withdraw such a test after it is challenged).
If someone submitted a test in this area (hint, hint!) I definitely would be
interested.
Randy.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-11 0:35 ` Randy Brukardt
@ 2017-05-11 8:24 ` Dmitry A. Kazakov
2017-05-15 22:53 ` Randy Brukardt
0 siblings, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-11 8:24 UTC (permalink / raw)
On 11/05/2017 02:35, Randy Brukardt wrote:
> Still, it would be nice to try something in this area. One could imagine
> creating an Annex C test that implemented a lock-free algorithm and just
> tried to see if it worked properly.
The real application used socked communication. The problem I suspect is
that artificial tests would not show problems full-size application
have. Maybe, a better strategy could be specific low-level tests for
atomic objects used in lock-free algorithms.
And we really should extend the Annex C with some basic atomic
operations like atomic increment, test-then-increment,
test-then-exchange etc, all falling back to protected objects if the
machine lacks corresponding instructions.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-11 8:24 ` Dmitry A. Kazakov
@ 2017-05-15 22:53 ` Randy Brukardt
2017-05-18 17:44 ` Dmitry A. Kazakov
0 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-15 22:53 UTC (permalink / raw)
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
news:of174q$13o7$1@gioia.aioe.org...
> On 11/05/2017 02:35, Randy Brukardt wrote:
>
>> Still, it would be nice to try something in this area. One could imagine
>> creating an Annex C test that implemented a lock-free algorithm and just
>> tried to see if it worked properly.
>
> The real application used socked communication. The problem I suspect is
> that artificial tests would not show problems full-size application have.
> Maybe, a better strategy could be specific low-level tests for atomic
> objects used in lock-free algorithms.
That's what I was thinking. But I'm not really the person to propose such
tests, I know just enough to be dangerous. ;-) I'd be much better served
being just an editor on such tests, as I wouldn't have too much trouble
seeing things that need to be more portable (I have a lot of experience
doing that).
> And we really should extend the Annex C with some basic atomic operations
> like atomic increment, test-then-increment, test-then-exchange etc, all
> falling back to protected objects if the machine lacks corresponding
> instructions.
I think you are right about that. Supposedly, implementations are supposed
to provide access to such things via machine code insertions, but that is
never going to be portable (even to other implementations for the same
target). So that's not really helpful.
Perhaps you could to send a problem statement and brief proposal to
Ada-Comment? There's still time to make suggestions for Ada 2020, and
tasking issues of all kinds are right in the wheelhouse of what we're hoping
to accomplish. (As always, the ARG can find ways to solve problems, the real
issue is knowing what the problems are. I believe that you are saying that
creating portable lock-free algorithms are harder than necessary because of
limitations in what can be atomic and what can be safely done with atomic
objects. That seems like an important problem for the ARG to spend some time
on.)
Randy.
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-15 22:53 ` Randy Brukardt
@ 2017-05-18 17:44 ` Dmitry A. Kazakov
2017-05-18 21:01 ` Randy Brukardt
0 siblings, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-18 17:44 UTC (permalink / raw)
On 16/05/2017 00:53, Randy Brukardt wrote:
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
> news:of174q$13o7$1@gioia.aioe.org...
>> On 11/05/2017 02:35, Randy Brukardt wrote:
>>
>>> Still, it would be nice to try something in this area. One could imagine
>>> creating an Annex C test that implemented a lock-free algorithm and just
>>> tried to see if it worked properly.
>>
>> The real application used socked communication. The problem I suspect is
>> that artificial tests would not show problems full-size application have.
>> Maybe, a better strategy could be specific low-level tests for atomic
>> objects used in lock-free algorithms.
>
> That's what I was thinking. But I'm not really the person to propose such
> tests, I know just enough to be dangerous. ;-) I'd be much better served
> being just an editor on such tests, as I wouldn't have too much trouble
> seeing things that need to be more portable (I have a lot of experience
> doing that).
>
>> And we really should extend the Annex C with some basic atomic operations
>> like atomic increment, test-then-increment, test-then-exchange etc, all
>> falling back to protected objects if the machine lacks corresponding
>> instructions.
>
> I think you are right about that. Supposedly, implementations are supposed
> to provide access to such things via machine code insertions, but that is
> never going to be portable (even to other implementations for the same
> target). So that's not really helpful.
>
> Perhaps you could to send a problem statement and brief proposal to
> Ada-Comment? There's still time to make suggestions for Ada 2020, and
> tasking issues of all kinds are right in the wheelhouse of what we're hoping
> to accomplish. (As always, the ARG can find ways to solve problems, the real
> issue is knowing what the problems are. I believe that you are saying that
> creating portable lock-free algorithms are harder than necessary because of
> limitations in what can be atomic and what can be safely done with atomic
> objects. That seems like an important problem for the ARG to spend some time
> on.)
I was thinking about something slightly higher level than tedious CAS.
E.g. predefined generic package:
generic
type Value_Type is (<>);
Initial_Value : Value_Type'Base;
package Atomic_Object is
--
-- Holder_Type -- Type of the atomic object to keep Value_Type'Base
--
type Holder_Type is private;
pragma Atomic (Holder_Type); -- It is a private type,
-- so it must be safe to assign
--
-- Get actual value
--
function Load (Item : Holder_Type) return Value_Type'Base;
--
-- Set new value
--
procedure Store (Item : in out Holder_Type;
New_Value : Value_Type'Base);
procedure Store (Item : in out Holder;
New_Value : Value_Type'Base;
Old_Value : out Value_Type'Base);
--
-- Compare old value to the barrier and if successful
-- store new value
--
procedure Store_If_Greater (Item : in out Holder_Type;
Barrier : Value_Type'Base;
New_Value : Value_Type'Base);
procedure Store_If_Greater (Item : in out Holder_Type;
Barrier : Value_Type'Base;
New_Value : Value_Type'Base;
Old_Value : out Value_Type'Base);
procedure Store_If_Equal ...
procedure Store_If_Inequal ...
procedure Store_If_Less ...
procedure Store_If_Greater ...
--
-- Increment value
--
procedure Increment (Item : in out Holder;
Increment : Value_Type'Base);
procedure Increment (Item : in out Holder;
Increment : Value_Type'Base;
Old_Value : out Value_Type'Base);
--
-- Compare old value to the barrier and if successful increment
--
procedure Increment_If_Greater (Item : in out Holder_Type;
Barrier : Value_Type'Base;
Increment : Value_Type'Base);
procedure Increment_If_Greater (Item : in out Holder_Type;
Barrier : Value_Type'Base;
Increment : Value_Type'Base;
Old_Value : out Value_Type'Base);
procedure Increment_If_Equal ...
procedure Increment_If_Inequal ...
procedure Increment_If_Less ...
procedure Increment_If_Greater ...
end Atomic_Object;
The implementation should be free to use machine instructions, a
spin-lock, or a protected object.
Example: reference counted object:
type Count_Type is mod 2**32;
package Atomic_Counters is new Atomic_Object (Count_Type, 0);
type Reference_Counted is record
Use_Count : Atomic_Counters.Holder_Type;
...
end record;
type Reference_Counted_Ptr is access all Reference_Counted;
procedure Free is new Ada.Unchecked_Deallocation (Reference_Counted,
Reference_Counted_Ptr);
type Handle is new Ada.Finalization.Controlled with record
Target : Reference_Counted_Ptr;
end record;
procedure Adjust (Pointer : in out Handle) is
begin -- Could check overflows using Increment_If_Less
Atomic_Counters.Increment (Pointer.Target.Use_Count, 1);
end Adjust;
procedure Finalize (Pointer : in out Handle) is
Old_Value : Count_Type;
begin
if Pointer.Target /= null then
Atomic_Counters.Increment_If_Greater
( Item => Pointer.Target.Use_Count,
Barrier => 0,
Increment => 0 - 1, -- Decrement by 1
Old_Value => Old_Value
);
case Old_Value is
when 0 => -- Use count is unexpectedly 0
raise Program_Error;
when 1 => -- Going to destroy the target
Free (Pointer.Target);
when others =>
null;
end case;
end if;
end Finalize;
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-18 17:44 ` Dmitry A. Kazakov
@ 2017-05-18 21:01 ` Randy Brukardt
2017-05-19 7:54 ` Dmitry A. Kazakov
0 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-18 21:01 UTC (permalink / raw)
As I always repeat, the ARG is looking more for problems than for solutions
(we're plenty inventive in that regard). The most useful thing is a problem
statement on the lines of "I can't write a portable <something> algorithm
because Ada doesn't provide any portable <blah> operations". Along with an
explanation of the <something> algorithm.
It's fairly obvious to me that since Atomic only protects a read or a write,
but not both that classic operations like test-and-set can't be (directly)
written using it. Some algorithms can be constructed without test-and-set,
but plenty of others can't be.
I'd think that the feature in question needs to be a low-level as possible
so that it can be implemented as efficiently as possible (if a protected
object is needed to implement it, there is no value to the low-level feature
as anyone can -- and should -- use a PO if that is possible).
I'd have no idea how to implement your generic without resorting to a PO,
which would defeat the purpose. (We recently had this discussion wrt
Suspension_Objects -- they can always be implemented with a PO, but the
entire point was that they ought to be simpler than that. Thus we didn't
make any additional requirements on them and in fact clarified that only one
task is supposed to be calling Suspend_Until_True. The same sort of dynamic
seems to be in effect here.)
Anyway, as I said, I know just enough to be dangerous here. You clearly know
more, so send some comments to Ada-Comment to get a discussion started. Else
it will never make the agenda...and then its pretty certain that no changes
will happen!
Randy.
"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
news:ofkmin$1iog$1@gioia.aioe.org...
> On 16/05/2017 00:53, Randy Brukardt wrote:
>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
>> news:of174q$13o7$1@gioia.aioe.org...
>>> On 11/05/2017 02:35, Randy Brukardt wrote:
>>>
>>>> Still, it would be nice to try something in this area. One could
>>>> imagine
>>>> creating an Annex C test that implemented a lock-free algorithm and
>>>> just
>>>> tried to see if it worked properly.
>>>
>>> The real application used socked communication. The problem I suspect is
>>> that artificial tests would not show problems full-size application
>>> have.
>>> Maybe, a better strategy could be specific low-level tests for atomic
>>> objects used in lock-free algorithms.
>>
>> That's what I was thinking. But I'm not really the person to propose such
>> tests, I know just enough to be dangerous. ;-) I'd be much better served
>> being just an editor on such tests, as I wouldn't have too much trouble
>> seeing things that need to be more portable (I have a lot of experience
>> doing that).
>>
>>> And we really should extend the Annex C with some basic atomic
>>> operations
>>> like atomic increment, test-then-increment, test-then-exchange etc, all
>>> falling back to protected objects if the machine lacks corresponding
>>> instructions.
>>
>> I think you are right about that. Supposedly, implementations are
>> supposed
>> to provide access to such things via machine code insertions, but that is
>> never going to be portable (even to other implementations for the same
>> target). So that's not really helpful.
>>
>> Perhaps you could to send a problem statement and brief proposal to
>> Ada-Comment? There's still time to make suggestions for Ada 2020, and
>> tasking issues of all kinds are right in the wheelhouse of what we're
>> hoping
>> to accomplish. (As always, the ARG can find ways to solve problems, the
>> real
>> issue is knowing what the problems are. I believe that you are saying
>> that
>> creating portable lock-free algorithms are harder than necessary because
>> of
>> limitations in what can be atomic and what can be safely done with atomic
>> objects. That seems like an important problem for the ARG to spend some
>> time
>> on.)
>
> I was thinking about something slightly higher level than tedious CAS.
> E.g. predefined generic package:
>
> generic
> type Value_Type is (<>);
> Initial_Value : Value_Type'Base;
> package Atomic_Object is
> --
> -- Holder_Type -- Type of the atomic object to keep Value_Type'Base
> --
> type Holder_Type is private;
> pragma Atomic (Holder_Type); -- It is a private type,
> -- so it must be safe to assign
> --
> -- Get actual value
> --
> function Load (Item : Holder_Type) return Value_Type'Base;
> --
> -- Set new value
> --
> procedure Store (Item : in out Holder_Type;
> New_Value : Value_Type'Base);
> procedure Store (Item : in out Holder;
> New_Value : Value_Type'Base;
> Old_Value : out Value_Type'Base);
> --
> -- Compare old value to the barrier and if successful
> -- store new value
> --
> procedure Store_If_Greater (Item : in out Holder_Type;
> Barrier : Value_Type'Base;
> New_Value : Value_Type'Base);
> procedure Store_If_Greater (Item : in out Holder_Type;
> Barrier : Value_Type'Base;
> New_Value : Value_Type'Base;
> Old_Value : out Value_Type'Base);
> procedure Store_If_Equal ...
> procedure Store_If_Inequal ...
> procedure Store_If_Less ...
> procedure Store_If_Greater ...
> --
> -- Increment value
> --
> procedure Increment (Item : in out Holder;
> Increment : Value_Type'Base);
> procedure Increment (Item : in out Holder;
> Increment : Value_Type'Base;
> Old_Value : out Value_Type'Base);
> --
> -- Compare old value to the barrier and if successful increment
> --
> procedure Increment_If_Greater (Item : in out Holder_Type;
> Barrier : Value_Type'Base;
> Increment : Value_Type'Base);
> procedure Increment_If_Greater (Item : in out Holder_Type;
> Barrier : Value_Type'Base;
> Increment : Value_Type'Base;
> Old_Value : out Value_Type'Base);
> procedure Increment_If_Equal ...
> procedure Increment_If_Inequal ...
> procedure Increment_If_Less ...
> procedure Increment_If_Greater ...
>
> end Atomic_Object;
>
> The implementation should be free to use machine instructions, a
> spin-lock, or a protected object.
>
> Example: reference counted object:
>
> type Count_Type is mod 2**32;
> package Atomic_Counters is new Atomic_Object (Count_Type, 0);
>
> type Reference_Counted is record
> Use_Count : Atomic_Counters.Holder_Type;
> ...
> end record;
>
> type Reference_Counted_Ptr is access all Reference_Counted;
> procedure Free is new Ada.Unchecked_Deallocation (Reference_Counted,
> Reference_Counted_Ptr);
> type Handle is new Ada.Finalization.Controlled with record
> Target : Reference_Counted_Ptr;
> end record;
>
> procedure Adjust (Pointer : in out Handle) is
> begin -- Could check overflows using Increment_If_Less
> Atomic_Counters.Increment (Pointer.Target.Use_Count, 1);
> end Adjust;
>
> procedure Finalize (Pointer : in out Handle) is
> Old_Value : Count_Type;
> begin
> if Pointer.Target /= null then
> Atomic_Counters.Increment_If_Greater
> ( Item => Pointer.Target.Use_Count,
> Barrier => 0,
> Increment => 0 - 1, -- Decrement by 1
> Old_Value => Old_Value
> );
> case Old_Value is
> when 0 => -- Use count is unexpectedly 0
> raise Program_Error;
> when 1 => -- Going to destroy the target
> Free (Pointer.Target);
> when others =>
> null;
> end case;
> end if;
> end Finalize;
>
> --
> Regards,
> Dmitry A. Kazakov
> http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: Portable memory barrier?
2017-05-18 21:01 ` Randy Brukardt
@ 2017-05-19 7:54 ` Dmitry A. Kazakov
0 siblings, 0 replies; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-19 7:54 UTC (permalink / raw)
On 18/05/2017 23:01, Randy Brukardt wrote:
> As I always repeat, the ARG is looking more for problems than for solutions
> (we're plenty inventive in that regard). The most useful thing is a problem
> statement on the lines of "I can't write a portable <something> algorithm
> because Ada doesn't provide any portable <blah> operations". Along with an
> explanation of the <something> algorithm.
>
> It's fairly obvious to me that since Atomic only protects a read or a write,
> but not both that classic operations like test-and-set can't be (directly)
> written using it. Some algorithms can be constructed without test-and-set,
> but plenty of others can't be.
AFAIK, only FIFO can rely solely on Load and Store. Almost everything
else requires test and/or old value in some form.
> I'd think that the feature in question needs to be a low-level as possible
> so that it can be implemented as efficiently as possible (if a protected
> object is needed to implement it, there is no value to the low-level feature
> as anyone can -- and should -- use a PO if that is possible).
There is no value if it is too low level. You don't get it portable
anyway in the sense that there is no guaranty that each target would
have an instruction for that. Not even CAS (Store_If_Equal in the
generic package outline).
Note that some higher-level operations are implementable with CAS. E.g.
atomic increment.
> I'd have no idea how to implement your generic without resorting to a PO,
> which would defeat the purpose.
Not at all. The programmer has no idea what would be more efficient to
implement atomic increment on the given machine:
1. CAS in a loop
2. OS call (e.g. under Windows)
3. Built-in operation (e.g. in GCC)
4. protected object
Even if he knows the outcome is not portable except PO, which you
already declared as no runner. The compiler vendor has and can do the
right choice and the programmer just does not care what is going behind.
> (We recently had this discussion wrt
> Suspension_Objects -- they can always be implemented with a PO, but the
> entire point was that they ought to be simpler than that. Thus we didn't
> make any additional requirements on them and in fact clarified that only one
> task is supposed to be calling Suspend_Until_True. The same sort of dynamic
> seems to be in effect here.)
[OT
IMO raw suspension objects have no use whatsoever. What is really
required is entry interface for a tagged type so that you could use
timed entry calls in the interfaces rather than passing timeout
parameter around and handling exceptions later:
select
Driver.Queue (IO_Request);
or delay 0.001;
-- The device is busy
end select;
/OT]
> Anyway, as I said, I know just enough to be dangerous here. You clearly know
> more, so send some comments to Ada-Comment to get a discussion started. Else
> it will never make the agenda...and then its pretty certain that no changes
> will happen!
I hoped to get some input from c.l.a. Maybe nobody cares, or maybe some
have other and better ideas, use cases etc.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
^ permalink raw reply [flat|nested] 65+ messages in thread