comp.lang.ada
 help / color / mirror / Atom feed
* Portable memory barrier?
@ 2017-05-06  2:23 Jere
  2017-05-06  8:47 ` Jeffrey R. Carter
  2017-05-07 20:18 ` Robert Eachus
  0 siblings, 2 replies; 65+ messages in thread
From: Jere @ 2017-05-06  2:23 UTC (permalink / raw)


So I am implementing a circular buffer, which means I don't need to use a protected object between the Consumer and Producer (assuming 1 each ... Consumer adjusts one index, Producer adjusts another), but I have run into one area where CPU architecture could get me.

In one procedure (Producer's Put procedure), I run into the situation where I need to update the buffer, then the index (in that order) to ensure that the consumer doesn't see the index change before the data is actually there.

From a compiler optimization perspective, I believe I can tackle this with the Volatile pragma/aspect.  Assuming it works like it does in C, volatile variables have to have their accesses in sequence.  I think the RM also reflects this [0]?

However, I know some CPUs can further reorder instructions.  I know those specific processors usually have something like a memory barrier to force one write before the other, but I wanted to avoid doing inline assembly.  

Are there any Ada language constructs that can help?  I don't mind using protected objects, but I don't know if that guarantees me a memory barrier for CPU's that support that (or whatever means an architecture supplies).  I don't, however, want to use a protected object between the Consumer and the Producer so that one doesn't block out the other.

If a language portable construct does not exist, does a GNAT one exist?  I've been searching around tonight but haven't run into anything yet in the Ada front.  For GNAT I think Pragma Enable_Atomic_Synchronization [1] might work (assuming I use the Atomic pragma on a correctly sized variable)?

[0] (C.6 16/3) http://www.ada-auth.org/standards/12rm/html/RM-C-6.html
[1] https://gcc.gnu.org/onlinedocs/gnat_rm/Pragma-Enable_005fAtomic_005fSynchronization.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-06  2:23 Portable memory barrier? Jere
@ 2017-05-06  8:47 ` Jeffrey R. Carter
  2017-05-06 14:17   ` Jere
  2017-05-07 20:18 ` Robert Eachus
  1 sibling, 1 reply; 65+ messages in thread
From: Jeffrey R. Carter @ 2017-05-06  8:47 UTC (permalink / raw)


On 05/06/2017 04:23 AM, Jere wrote:
>
> In one procedure (Producer's Put procedure), I run into the situation where I
> need to update the buffer, then the index (in that order) to ensure that the
> consumer doesn't see the index change before the data is actually there.
>
> From a compiler optimization perspective, I believe I can tackle this with
> the Volatile pragma/aspect.  Assuming it works like it does in C, volatile
> variables have to have their accesses in sequence.  I think the RM also
> reflects this [0]?

Volatile means that something other than the program may change the value. It 
also means that all tasks of the program see the same sequence of updates. It 
does not mean that accesses are non-overlapping. If task A does

I := I + 1;

and task B does

if I > 0 then

with I marked Volatile, then A may update half of I, B read the partly updated 
value, and then A complete its update.

Atomic means Volatile plus all access are sequential; in the example, if A 
begins to update I, B won't read it until A's update is complete.

For your purposes, marking the (type of the) indices as Atomic should be all you 
need.

-- 
Jeff Carter
"[M]any were collected near them, ... to
enjoy the sight of a dead young lady, nay,
two dead young ladies, for it proved twice
as fine as the first report."
Persuasion
155

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-06  8:47 ` Jeffrey R. Carter
@ 2017-05-06 14:17   ` Jere
  2017-05-06 19:08     ` Dennis Lee Bieber
                       ` (3 more replies)
  0 siblings, 4 replies; 65+ messages in thread
From: Jere @ 2017-05-06 14:17 UTC (permalink / raw)


On Saturday, May 6, 2017 at 4:47:59 AM UTC-4, Jeffrey R. Carter wrote:
> On 05/06/2017 04:23 AM, Jere wrote:
> >
> > In one procedure (Producer's Put procedure), I run into the situation where I
> > need to update the buffer, then the index (in that order) to ensure that the
> > consumer doesn't see the index change before the data is actually there.
> >
> > From a compiler optimization perspective, I believe I can tackle this with
> > the Volatile pragma/aspect.  Assuming it works like it does in C, volatile
> > variables have to have their accesses in sequence.  I think the RM also
> > reflects this [0]?
> 
> Volatile means that something other than the program may change the value. It 
> also means that all tasks of the program see the same sequence of updates. It 
> does not mean that accesses are non-overlapping. If task A does
> 
> I := I + 1;
> 
> and task B does
> 
> if I > 0 then
> 
> with I marked Volatile, then A may update half of I, B read the partly updated 
> value, and then A complete its update.
> 
> Atomic means Volatile plus all access are sequential; in the example, if A 
> begins to update I, B won't read it until A's update is complete.
> 
> For your purposes, marking the (type of the) indices as Atomic should be all you 
> need.

Thanks for the response!  I'm good with all that.  I'm more worried about 
situations like these:

Task A:
-- Do some checks to make sure buffer not full

-- Not full, so put data
My_Array(Put_Index) := Some_Value;
if Put_Index = Buffer_Size then
   Put_Index := 0;
else
   Put_Index := Put_Index + 1;
end if;

Task B:
if Put_Index = Get_Index then -- empty
   return False;
end if;

Result_Value := My_Array(Get_Index);
-- Do other operations that update Get_Index

Assuming the indexes are atomic and volatile, then I am not worried about 
partial writes to the indexes.  I"m more worried about some architectures that 
allow for instruction reordering.  What I don't want to happen is in Task A that 
the store to My_Array happens after the update to the Put_Index.  If that 
happens, then there are cases where Task B verifies the buffer isn't empty and 
tries to access the array but Task A hasn't gotten around to updating the array 
because the CPU decided to rearrange the stores.

In C++, I would handle this with a fence, which is part of the standard.
My_Array[Put_Index] = Some_Value;

//force CPU to do the previous store before doing ones after the fence
std::atomic_thread_fence(std::memory_order_release);

if (Put_Index == Buffer_Size){
   Put_Index = 0;
else
   Put_Index++;
}

With that I know that architectures that do reordering and supply a means to 
prevent that, the appropriate mechanism will be used to ensure Put_Index is 
updated after My_Array;

I'm not so much worried about atomic vs non atomic updates as I am sequential 
updates of different variables.

I'm looking for a language defined way in Ada (or in GNAT if not) that will
implement a fence like mechanism on architectures that require it.

Side note:  If I read it right, you seemed to indicate that volatile has nothing 
to do with sequence.  I'm no Ada expert, so I wanted to ask your interpretation 
of section 16/3 of C.6.  I'm used to C and C++ where if you have two volatile 
variables being updated like this:

Vol_Var_1 = 0;
Vol_Var_2 = 23;

That the C and C++ standard guarantees the "compiler" will not reorder those two 
statements.  Note this says nothing about atomicness (which is only one reason 
among others why Volatile is not sufficient for threading).  It also doesn't 
prevent the processor from reordering them.  When I read that section in C.6 of 
the RM, the verbage made me think Ada also prevented the compiler from 
reordering the statements (again no relation to atomicness or CPU reordering).

Is this not the case?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-06 14:17   ` Jere
@ 2017-05-06 19:08     ` Dennis Lee Bieber
  2017-05-06 19:26     ` Jeffrey R. Carter
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 65+ messages in thread
From: Dennis Lee Bieber @ 2017-05-06 19:08 UTC (permalink / raw)


On Sat, 6 May 2017 07:17:10 -0700 (PDT), Jere <jhb.chat@gmail.com>
declaimed the following:

>Is this not the case?

	My understanding of "volatile" is that it means: do not cache/optimize
(say, by assigning it to a register within the scope of the procedure);
instead, every source (read) usage must read the physical address, on the
basis that some other process (or hardware) may have changed the contents
behind your back.


	While a processor may reorder operations for efficiency, no
optimization (run-time or compiler) should reorder an update if the
subsequent operation needs to read the updated value.

	ipt := ipt + 1;
	arr(ipt) := something;
	rndvar := othervar;

could reorder the last line to anywhere in the sequence.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-06 14:17   ` Jere
  2017-05-06 19:08     ` Dennis Lee Bieber
@ 2017-05-06 19:26     ` Jeffrey R. Carter
  2017-05-06 19:41     ` Jeffrey R. Carter
  2017-05-09 19:49     ` Randy Brukardt
  3 siblings, 0 replies; 65+ messages in thread
From: Jeffrey R. Carter @ 2017-05-06 19:26 UTC (permalink / raw)


On 05/06/2017 04:17 PM, Jere wrote:
>
> Assuming the indexes are atomic and volatile, then I am not worried about
> partial writes to the indexes.  I"m more worried about some architectures that
> allow for instruction reordering.  What I don't want to happen is in Task A that
> the store to My_Array happens after the update to the Put_Index.  If that
> happens, then there are cases where Task B verifies the buffer isn't empty and
> tries to access the array but Task A hasn't gotten around to updating the array
> because the CPU decided to rearrange the stores.

Yes, Ada, having had tasking since the beginning, has ways to deal with this. 
They involve grouping the index increment and the array assignment together into 
a single thing that is atomic with respect to the index read. This is called a 
"critical region" in CS, and is usually a protected operation in Ada, although 
one could also use a suspension object or machine-code insertion.

-- 
Jeff Carter
"[M]any were collected near them, ... to
enjoy the sight of a dead young lady, nay,
two dead young ladies, for it proved twice
as fine as the first report."
Persuasion
155


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-06 14:17   ` Jere
  2017-05-06 19:08     ` Dennis Lee Bieber
  2017-05-06 19:26     ` Jeffrey R. Carter
@ 2017-05-06 19:41     ` Jeffrey R. Carter
  2017-05-06 20:42       ` Niklas Holsti
  2017-05-09 19:49     ` Randy Brukardt
  3 siblings, 1 reply; 65+ messages in thread
From: Jeffrey R. Carter @ 2017-05-06 19:41 UTC (permalink / raw)


On 05/06/2017 04:17 PM, Jere wrote:
>
> Side note:  If I read it right, you seemed to indicate that volatile has nothing
> to do with sequence.  I'm no Ada expert, so I wanted to ask your interpretation
> of section 16/3 of C.6.  I'm used to C and C++ where if you have two volatile
> variables being updated like this:
>
> Vol_Var_1 = 0;
> Vol_Var_2 = 23;
>
> That the C and C++ standard guarantees the "compiler" will not reorder those two
> statements.  Note this says nothing about atomicness (which is only one reason
> among others why Volatile is not sufficient for threading).  It also doesn't
> prevent the processor from reordering them.  When I read that section in C.6 of
> the RM, the verbage made me think Ada also prevented the compiler from
> reordering the statements (again no relation to atomicness or CPU reordering).
>
> Is this not the case?

Volatile is intended to guarantee that all reads and writes deal with the memory 
location directly. The idea is for things like memory-mapped H/W registers, 
where the H/W might change the value of the register or writing to the register 
causes the H/W to do something. This prevents optimizations that keep the value 
in a register for a while before writing it back to memory, for example, but not 
reordering the accesses to different variables IIUC. However, this is getting 
into the area where language lawyers are needed.

-- 
Jeff Carter
"[M]any were collected near them, ... to
enjoy the sight of a dead young lady, nay,
two dead young ladies, for it proved twice
as fine as the first report."
Persuasion
155


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-06 19:41     ` Jeffrey R. Carter
@ 2017-05-06 20:42       ` Niklas Holsti
  0 siblings, 0 replies; 65+ messages in thread
From: Niklas Holsti @ 2017-05-06 20:42 UTC (permalink / raw)


On 17-05-06 22:41 , Jeffrey R. Carter wrote:
> On 05/06/2017 04:17 PM, Jere wrote:
>>
>> Side note:  If I read it right, you seemed to indicate that volatile
>> has nothing
>> to do with sequence.  I'm no Ada expert, so I wanted to ask your
>> interpretation
>> of section 16/3 of C.6.  I'm used to C and C++ where if you have two
>> volatile
>> variables being updated like this:
>>
>> Vol_Var_1 = 0;
>> Vol_Var_2 = 23;
>>
>> That the C and C++ standard guarantees the "compiler" will not reorder
>> those two
>> statements.  Note this says nothing about atomicness (which is only
>> one reason
>> among others why Volatile is not sufficient for threading).  It also
>> doesn't
>> prevent the processor from reordering them.  When I read that section
>> in C.6 of
>> the RM, the verbage made me think Ada also prevented the compiler from
>> reordering the statements (again no relation to atomicness or CPU
>> reordering).
>>
>> Is this not the case?
>
> Volatile is intended to guarantee that all reads and writes deal with
> the memory location directly.

Well... for some value of "directly". As I understand it, reading or 
writing a volatile variable can still use the cache, so depending on the 
cache architecture and state, a read may or may not read the "real" 
memory, and ditto for a write.

> The idea is for things like memory-mapped
> H/W registers, where the H/W might change the value of the register or
> writing to the register causes the H/W to do something.

And such registers are usually defined in the HW as non-cacheable, so in 
the end volatile access to such registers works as intended, and as you say.

> This prevents
> optimizations that keep the value in a register for a while before
> writing it back to memory, for example, but not reordering the accesses
> to different variables IIUC.

Just as in C, volatile accesses in Ada are guaranteed to occur in the 
order and number written in the source code and executed in the 
"standard" order:

RM C.6 (16/3): All tasks of the program (on all processors) that read or 
update volatile variables see the same order of updates to the 
variables. A use of an atomic variable or other mechanism may be 
necessary to avoid erroneous execution and to ensure that access to 
nonatomic volatile variables is sequential (see 9.10).

RM C.6 (20): The external effect of a program (see 1.1.3) is defined to 
include each read and update of a volatile or atomic object. The 
implementation shall not generate any memory reads or updates of atomic 
or volatile objects other than those specified by the program.


-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-06  2:23 Portable memory barrier? Jere
  2017-05-06  8:47 ` Jeffrey R. Carter
@ 2017-05-07 20:18 ` Robert Eachus
  2017-05-08  7:45   ` Dmitry A. Kazakov
  1 sibling, 1 reply; 65+ messages in thread
From: Robert Eachus @ 2017-05-07 20:18 UTC (permalink / raw)


On Friday, May 5, 2017 at 10:23:45 PM UTC-4, Jere wrote:
 
> Are there any Ada language constructs that can help?

No, but you don't need the help you think you need.  On modern, superscalar CPUs, you may have the assignment to the buffer and the index operation occurring as part of the same clock cycle.  Then the writes (to memory if memory locations are assigned) will occur in the order required by the ISA (for x86, what you require). For some RISC processors a fence will need to occur as part of that bundle of instructions, but trust the compiler for that.

If the read of the circular buffer is occurring possibly on a different CPU core which shares one or more cache levels, the instructions to make a location Volatile will be ignored by the CPU.  The compiler will mark the location, and the CPU will keep track of the type of memory, but Volatile does not cause writes to main memory. What will happen instead is that the cache coherency subsystem will insure that any accesses by CPU cores, GPUs, disk drives, or whatever will see a virtual location--which can move about--but all reads and writes will occur in sequence.

Didn't you ever wonder why modern CPUs have billions of transistors?  It is to keep things like this straight.  By the way, it can be the case that the first of two successive writes (with no intervening reads) will be eliminated, and/or additional reads may be added.  More to the point, if you cause an interrupt, the CPU will put that CPU core in some state consistent with memory.  Well, where a cache flush will create such a state in memory.  Without a cache flush, the CPU only cares that what it sees through the cache--and write pipes--is consistent.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-07 20:18 ` Robert Eachus
@ 2017-05-08  7:45   ` Dmitry A. Kazakov
  2017-05-08 15:56     ` Robert Eachus
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-08  7:45 UTC (permalink / raw)


On 07/05/2017 22:18, Robert Eachus wrote:
> On Friday, May 5, 2017 at 10:23:45 PM UTC-4, Jere wrote:
>
>> Are there any Ada language constructs that can help?
>
> If the read of the circular buffer is occurring possibly on a
> different CPU core which shares one or more cache levels, the
> instructions to make a location Volatile will be ignored by the CPU. The
> compiler will mark the location, and the CPU will keep track of the type
> of memory, but Volatile does not cause writes to main memory.

That is ordering accesses to the same location, while the concern is 
about accesses to two independent locations and possibility of these 
being reordered by the CPU. In particular the reader's part:

    read location A [write-pointer for buffer empty check]
    read location B [buffer item at read-pointer]

The writer's part

    write location B [buffer item at write-pointer]
    write location A [write-pointer increment]

Now if you reorder reader's or writer's sequences you will get a bug. 
Which will never happen for buffers larger than one item, but nonetheless.

I don't believe either of the sequences can get reordered by the CPU, 
but it is only belief and common sense.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08  7:45   ` Dmitry A. Kazakov
@ 2017-05-08 15:56     ` Robert Eachus
  2017-05-08 16:22       ` Dmitry A. Kazakov
  0 siblings, 1 reply; 65+ messages in thread
From: Robert Eachus @ 2017-05-08 15:56 UTC (permalink / raw)


One sentence summary: The CPU will play lots of tricks behind your back, but in such a way you can pretend your code works as you intended.  (Er, after you get the bugs out.)

On Monday, May 8, 2017 at 3:45:26 AM UTC-4, Dmitry A. Kazakov wrote:

> I don't believe either of the sequences can get reordered by the CPU, 
> but it is only belief and common sense.

Silly rabbit, Trix are for kids.  Oops, silly programmer, common sense has nothing to do with modern CPUs.

If I didn't make it clear, modern CPUs analyze which pending instructions can be done NOW, and they select some subset of those operations. This is called OoO (out of order) scheduling.  In addition, the CPU has a register file with lots of extra registers. It uses register renaming to insure that instructions "see" the right arguments, even if there are four or five different copies of the index in the register file.  This allows the CPU to put only one copy in the write pipe, and run the code faster than a naive counting of writes to memory suggests.  Finally, superscalar CPUs convert the instructions into micro-ops (name depends on CPU vendor) and then several of these micro-ops belonging to multiple instructions get executed all at once.

The CPU tries hard to eliminate reads and writes to main memory which are slow.  By putting the memory controller on the CPU, and having it look into the cache no matter the source of the request, even volatile memory can be virtualized.  (There are a few pages in memory which can't be virtualized, because they contain things like the real-time clock.  But that is a detail known to the CPU.)

All of this is done with state not associated with any real clock, but with the instructions as they execute.  If you think of a program counter as a pointer to the next instruction to be executed, and defining a state; modern CPUs have multiple states in play at the same time--and out of order.

Since in all this the CPU arranges things so that all of the outputs occur in the expected order. If you halt the CPU, then flush the cache, THEN there will be one point in the programmer's view of the code that corresponds to the current state--in other words, debuggers work.  (Assuming you look at the code as generated, or turn off optimization in the compiler.)

As far as this issue is concerned, the CPU implements the ISA, so you get the effects you expect from a program.  All of this is on an "as if" basis, and in fact the CPU is seldom in a reproducible state.

Is any of this important to a high-level language programmer?  Not really, unless they are working in hard real time. (Which I did.)  If you program in assembler, or more likely work on a compiler back-end, knowing which instructions fit together is important.  Even more important is coding so that an instruction that references a new cache line in memory is followed by as many instructions as possible that don't use that value.  Some times you run into cases where you want to take advantage of write combining or some other trick involving the write pipe--but today, smart compiler optimization is about accesses to main memory and not much else.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 15:56     ` Robert Eachus
@ 2017-05-08 16:22       ` Dmitry A. Kazakov
  2017-05-08 18:39         ` Robert Eachus
                           ` (4 more replies)
  0 siblings, 5 replies; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-08 16:22 UTC (permalink / raw)


On 2017-05-08 17:56, Robert Eachus wrote:
> One sentence summary: The CPU will play lots of tricks behind your
> back, but in such a way you can pretend your code works as you intended.
> (Er, after you get the bugs out.)
>
> On Monday, May 8, 2017 at 3:45:26 AM UTC-4, Dmitry A. Kazakov wrote:
>
>> I don't believe either of the sequences can get reordered by the CPU,
>> but it is only belief and common sense.
>
> Silly rabbit, Trix are for kids. Oops, silly programmer, common
> sense  has nothing to do with modern CPUs.
[...]
> Is any of this important to a high-level language programmer? Not
> really, unless they are working in hard real time. (Which I did.) If you
> program in assembler, or more likely work on a compiler back-end,
> knowing which instructions fit together is important. Even more
> important is coding so that an instruction that references a new cache
> line in memory is followed by as many instructions as possible that
> don't use that value. Some times you run into cases where you want to
> take advantage of write combining or some other trick involving the
> write pipe--but today, smart compiler optimization is about accesses to
> main memory and not much else.

I don't see anything there what would prevent CPU from changing the 
order of accesses to two *independent* memory locations when only the 
program logic ties them nothing else. Well, except that there is no 
obvious reason why it would like to change that order.

Wikipedia provides a more simple example of the problem:

    https://en.wikipedia.org/wiki/Memory_barrier

The case of a lock-free FIFO is different because it involves index and 
array element.

For practical use when the buffer size is > 1 there cannot be any problem.

Yet the question stands. In particular, which Ada primitives may ensure 
safe implementations of lock-free structures when no protected objects used.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 16:22       ` Dmitry A. Kazakov
@ 2017-05-08 18:39         ` Robert Eachus
  2017-05-08 19:18         ` Robert Eachus
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 65+ messages in thread
From: Robert Eachus @ 2017-05-08 18:39 UTC (permalink / raw)


On Monday, May 8, 2017 at 12:22:55 PM UTC-4, Dmitry A. Kazakov wrote:
> On 2017-05-08 17:56, Robert Eachus wrote:
> > One sentence summary: The CPU will play lots of tricks behind your
> > back, but in such a way you can pretend your code works as you intended.
> > (Er, after you get the bugs out.)
> >
> > On Monday, May 8, 2017 at 3:45:26 AM UTC-4, Dmitry A. Kazakov wrote:
> >
> >> I don't believe either of the sequences can get reordered by the CPU,
> >> but it is only belief and common sense.
> >
> > Silly rabbit, Trix are for kids. Oops, silly programmer, common
> > sense  has nothing to do with modern CPUs.
> [...]
> > Is any of this important to a high-level language programmer? Not
> > really, unless they are working in hard real time. (Which I did.) If you
> > program in assembler, or more likely work on a compiler back-end,
> > knowing which instructions fit together is important. Even more
> > important is coding so that an instruction that references a new cache
> > line in memory is followed by as many instructions as possible that
> > don't use that value. Some times you run into cases where you want to
> > take advantage of write combining or some other trick involving the
> > write pipe--but today, smart compiler optimization is about accesses to
> > main memory and not much else.
> 
> I don't see anything there what would prevent CPU from changing the 
> order of accesses to two *independent* memory locations when only the 
> program logic ties them nothing else. Well, except that there is no 
> obvious reason why it would like to change that order.
> 
> Wikipedia provides a more simple example of the problem:
> 
>     https://en.wikipedia.org/wiki/Memory_barrier
> 
> The case of a lock-free FIFO is different because it involves index and 
> array element.
> 
> For practical use when the buffer size is > 1 there cannot be any problem.
> 
> Yet the question stands. In particular, which Ada primitives may ensure 
> safe implementations of lock-free structures when no protected objects used.
> 
> -- 
> Regards,
> Dmitry A. Kazakov
> http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 16:22       ` Dmitry A. Kazakov
  2017-05-08 18:39         ` Robert Eachus
@ 2017-05-08 19:18         ` Robert Eachus
  2017-05-08 21:09           ` Dmitry A. Kazakov
  2017-05-08 20:29         ` Niklas Holsti
                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 65+ messages in thread
From: Robert Eachus @ 2017-05-08 19:18 UTC (permalink / raw)


On Monday, May 8, 2017 at 12:22:55 PM UTC-4, Dmitry A. Kazakov wrote:

> Yet the question stands. In particular, which Ada primitives may ensure 
> safe implementations of lock-free structures when no protected objects used.

There are two rules on x86 and AMD64 CPUs that govern:

Instructions must be retired in the correct order. (In practice, CPUs retire instructions in bunches.)

The execution state "seen" by an instruction reflects those instructions which were retired before it, and only those instructions.

In other words, the index update will occur, but the value to be used in the data field write will be kept in the original register.  This may result in two or more registers named AX, but that is a detail.  Only the appropriate version will be visible to any other instruction.

So the CPU can, and normally will, move the counter update before the data write.  The CPU wants to merge two writes to the same cache line when possible.  Obviously, it won't wait for another write of the counter or buffer, but if they are present in the reorder buffers, the CPU will do what it can to merge them. (The first one goes into the write pipe, but the CPU noticed that it will need that value for the associated data write.  So it pulls a copy out of the write buffer, while leaving the original write in place.  If the next update becomes available while the index is still in the write pipe, the CPU will cancel the original write, or change the value to be written.  (Depends on the particular hardware.  Also the CPUs may keep a copy in a renamed register, or it may rely on the copy in the write pipe.  Depends on the timing, and on the size of the register.)

If, and ONLY if, there is an interrupt, and it is due to an instruction with an associated state (usually a cache miss) the CPU will stop in the correct state, no matter how much patching up that requires.  (Minimize page faults, but you knew that. ;-)  THEN the interrupt will occur.  A page fault here is a reference to a memory location which is in the pagefile, not in memory.  The CPU may get involved in a cache miss, but that process is usually all in hardware and tries to be invisible to the instruction stream.  In other words, the CPU will try to anticipate cache misses, but when it does so, it won't cause an interrupt, just throw the prefetch away.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 16:22       ` Dmitry A. Kazakov
  2017-05-08 18:39         ` Robert Eachus
  2017-05-08 19:18         ` Robert Eachus
@ 2017-05-08 20:29         ` Niklas Holsti
  2017-05-08 21:09           ` Dmitry A. Kazakov
  2017-05-09 20:10           ` Randy Brukardt
  2017-05-09  0:05         ` Jere
  2017-05-09 19:53         ` Randy Brukardt
  4 siblings, 2 replies; 65+ messages in thread
From: Niklas Holsti @ 2017-05-08 20:29 UTC (permalink / raw)


On 17-05-08 19:22 , Dmitry A. Kazakov wrote:

[snip]

> I don't see anything there what would prevent CPU from changing the
> order of accesses to two *independent* memory locations when only the
> program logic ties them nothing else. Well, except that there is no
> obvious reason why it would like to change that order.

I quote from an earlier post of mine in this thread:

Just as in C, volatile accesses in Ada are guaranteed to occur in the 
order and number written in the source code and executed in the 
"standard" order [moreover, this includes all tasks in the program]:

RM C.6 (16/3): All tasks of the program (on all processors) that read or 
update volatile variables see the same order of updates to the 
variables. A use of an atomic variable or other mechanism may be 
necessary to avoid erroneous execution and to ensure that access to 
nonatomic volatile variables is sequential (see 9.10).

> Yet the question stands. In particular, which Ada primitives may ensure
> safe implementations of lock-free structures when no protected objects
> used.

Pragma Volatile + RM C.6 (16/3)? If these are not sufficient, why not?

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 20:29         ` Niklas Holsti
@ 2017-05-08 21:09           ` Dmitry A. Kazakov
  2017-05-09  4:34             ` Niklas Holsti
  2017-05-09 20:10           ` Randy Brukardt
  1 sibling, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-08 21:09 UTC (permalink / raw)


On 2017-05-08 22:29, Niklas Holsti wrote:
> On 17-05-08 19:22 , Dmitry A. Kazakov wrote:
>
> [snip]
>
>> I don't see anything there what would prevent CPU from changing the
>> order of accesses to two *independent* memory locations when only the
>> program logic ties them nothing else. Well, except that there is no
>> obvious reason why it would like to change that order.
>
> I quote from an earlier post of mine in this thread:
>
> Just as in C, volatile accesses in Ada are guaranteed to occur in the
> order and number written in the source code and executed in the
> "standard" order [moreover, this includes all tasks in the program]:
>
> RM C.6 (16/3): All tasks of the program (on all processors) that read or
> update volatile variables see the same order of updates to the
> variables.

This is ambiguous. Does it read as:

A. Order of updates per volatile variable

B. Order of updates to the set of all volatile variables

B is sufficient, A is not. Depending on the implementation A may imply B 
or not.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 19:18         ` Robert Eachus
@ 2017-05-08 21:09           ` Dmitry A. Kazakov
  2017-05-08 23:24             ` Robert Eachus
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-08 21:09 UTC (permalink / raw)


On 2017-05-08 21:18, Robert Eachus wrote:

> So the CPU can, and normally will, move the counter update before
> the  data write. The CPU wants to merge two writes to the same cache line
> when possible. Obviously, it won't wait for another write of the counter
> or buffer, but if they are present in the reorder buffers, the CPU will
> do what it can to merge them. (The first one goes into the write pipe,
> but the CPU noticed that it will need that value for the associated data
> write. So it pulls a copy out of the write buffer, while leaving the
> original write in place. If the next update becomes available while the
> index is still in the write pipe, the CPU will cancel the original
> write, or change the value to be written. (Depends on the particular
> hardware. Also the CPUs may keep a copy in a renamed register, or it may
> rely on the copy in the write pipe. Depends on the timing, and on the
> size of the register.)

This strategy does not prevent reordering writes:

    FIFO (Write_Index + 1) := Element;
    Write_Index := Write_Index + 1;

into

    Write_Index := Write_Index + 1;
    FIFO (Write_Index) := Element;

which would be a bug.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 21:09           ` Dmitry A. Kazakov
@ 2017-05-08 23:24             ` Robert Eachus
  2017-05-09  0:30               ` Jere
  2017-05-09 19:57               ` Randy Brukardt
  0 siblings, 2 replies; 65+ messages in thread
From: Robert Eachus @ 2017-05-08 23:24 UTC (permalink / raw)


On Monday, May 8, 2017 at 5:09:45 PM UTC-4, Dmitry A. Kazakov wrote:

> This strategy does not prevent reordering writes:
> 
>     FIFO (Write_Index + 1) := Element;
>     Write_Index := Write_Index + 1;
> 
> into
> 
>     Write_Index := Write_Index + 1;
>     FIFO (Write_Index) := Element;
> 
> which would be a bug.

You are UNDERthinking it.  There is a situation which can occur, and programmers should be wary of. (Skip to before the footnotes if you don't care about details.)

What happens is that the code is broken into:

r1 := Write_Index;
...
r2 := r1 x Element'Length;
r3 := r2 + FIFO'First;
r4 := Element;
(r3) := r4;
r1 := r1 + 1;
(Write_Index'Address) := r1;

(Parentheses write to the address.)

The compiler will play some games, and certainly will leave Write_Index in a register if possible.  But with each of these assignments, there is a context consisting of its set of registers.  So the two assignments to r1 go to different (renaming) registers. We dpn't care what the CPU calls them, as long as it keeps track of which r1 belongs to which contexts.

Now the CPU can reorder to its heart's content, as long as it doesn't try to use a value before it has been computed.*  Old values?  No problem, they are available in renamed registers.  The value of Write_Index (in r1) when the write to FIFO completes?  Who cares, it was used to compute the correct address earlier.

How does the CPU know when a value in a register is no longer needed? When all of the contexts containing that value of the register have been used.  Count every use of the value until the register is changed, and, of course, reduce the count on use.  The counts are not unbounded because you only count for current residents of the reordering buffers (ROBs) as micro-ops are parsed and loaded in.  What if you need more registers than you have in the renaming list?  Stall instruction parsing until one is freed up.  Almost never happens.  Even though the AMD64 ISA has 16 integer register names** that can appear in a program, CPUs have about a hundred renaming registers per core.  Recent architectures, Intel's Broadwell and AMD's Zen have 168 entry integer register files.  (This means 1344 total in an eight core chip.  Note that the renaming registers are shared between two threads per core. There are only 160 floating-point renaming registers in Zen--1280 total.  This is eight per core shy of Intel's Skylake and Kaby Lake.)


So what can happen that is a potential problem?  All those renaming registers, mean that the most recent written value of Write_Index can trail writes to the FIFO queue.  Normally this will only happen when you inline code, and there are multiple instances of the write routine inlined into a single (callable) procedure or function.  Now you can have several values in the FIFO which will not yet be readable.  As long as you are not worried about (nanosecond) latency, not an issue.  Worst case, the same thing happens on the read side, and the writing side needs to wait.  Being a former real-time programmer, you don't want the writes getting thrown on the floor.

You can have the same problem if you set the write side at a higher priority than the read side on a single core no multithreading CPU.  In either case, the solution is the same:  Make the FIFO large enough that such collisions (between the two indexes) don't happen.  This is the sort of issue that the proof of correctness folk don't see.  You can (and should) test for this in any hard real-time system, no matter how much theorem proving you do.  Assuming you do it right, now you have average and maximum queuing latencies to feed back into your theorem prover.  I'm not against theorem proving, I just know from experience that analysis and testing is also needed. ;-)

* Technically use a value before the pipe computing it has it ready for use.  Due to the nature of pipelined architectures, it matters not if that value is not ready until stage 10, if the instruction that needs it won't need it until (its) stage 9.  The CPU sees that as those two instructions cannot be dispatched on the same clock cycle, but otherwise the "store forwarding" makes it look like it only took one clock cycle to compute.  If you--and the CPU--are clever, you can even have the two micro-ops in the same pipe, one after the other.  If you are even more clever, you can choose micro-ops such that they can execute on the same clock cycle.

It is not that difficult to write the innermost loops of some programs to take only one clock cycle on average per iteration.  You have three or more instructions executing together, as parts of the loop for iterations n, n-1, and n-2.  So any iteration of the loop takes two clock cycles plus the pipeline depth, but the average iteration take very close to one clock cycle.

** Well 64-bit integer registers or parts of one, plus some other registers that it makes sense to treat as just another integer register.  Floating point values have different register names, and their own large renaming register set.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 16:22       ` Dmitry A. Kazakov
                           ` (2 preceding siblings ...)
  2017-05-08 20:29         ` Niklas Holsti
@ 2017-05-09  0:05         ` Jere
  2017-05-09  8:26           ` Dmitry A. Kazakov
  2017-05-09 19:53         ` Randy Brukardt
  4 siblings, 1 reply; 65+ messages in thread
From: Jere @ 2017-05-09  0:05 UTC (permalink / raw)


On Monday, May 8, 2017 at 12:22:55 PM UTC-4, Dmitry A. Kazakov wrote:  
> For practical use when the buffer size is > 1 there cannot be any problem.

I saw you mention this earlier.  Why does the buffer size prevent that problem?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 23:24             ` Robert Eachus
@ 2017-05-09  0:30               ` Jere
  2017-05-09  4:02                 ` Robert Eachus
                                   ` (2 more replies)
  2017-05-09 19:57               ` Randy Brukardt
  1 sibling, 3 replies; 65+ messages in thread
From: Jere @ 2017-05-09  0:30 UTC (permalink / raw)


On Monday, May 8, 2017 at 7:24:42 PM UTC-4, Robert Eachus wrote:
> On Monday, May 8, 2017 at 5:09:45 PM UTC-4, Dmitry A. Kazakov wrote:
> 
> > This strategy does not prevent reordering writes:
> > 
> >     FIFO (Write_Index + 1) := Element;
> >     Write_Index := Write_Index + 1;
> > 
> > into
> > 
> >     Write_Index := Write_Index + 1;
> >     FIFO (Write_Index) := Element;
> > 
> > which would be a bug.
> 
> You are UNDERthinking it.  There is a situation which can occur, and programmers should be wary of. (Skip to before the footnotes if you don't care about details.)
> 
> What happens is that the code is broken into:
> 
> r1 := Write_Index;
> ...
> r2 := r1 x Element'Length;
> r3 := r2 + FIFO'First;
> r4 := Element;
> (r3) := r4;
> r1 := r1 + 1;
> (Write_Index'Address) := r1;
> 
> (Parentheses write to the address.)
> 
> The compiler will play some games, and certainly will leave Write_Index in a register if possible.  But with each of these assignments, there is a context consisting of its set of registers.  So the two assignments to r1 go to different (renaming) registers. We dpn't care what the CPU calls them, as long as it keeps track of which r1 belongs to which contexts.
> 
> Now the CPU can reorder to its heart's content, as long as it doesn't try to use a value before it has been computed.*  Old values?  No problem, they are available in renamed registers.  The value of Write_Index (in r1) when the write to FIFO completes?  Who cares, it was used to compute the correct address earlier.
> 
> How does the CPU know when a value in a register is no longer needed? When all of the contexts containing that value of the register have been used.  Count every use of the value until the register is changed, and, of course, reduce the count on use.  The counts are not unbounded because you only count for current residents of the reordering buffers (ROBs) as micro-ops are parsed and loaded in.  What if you need more registers than you have in the renaming list?  Stall instruction parsing until one is freed up.  Almost never happens.  Even though the AMD64 ISA has 16 integer register names** that can appear in a program, CPUs have about a hundred renaming registers per core.  Recent architectures, Intel's Broadwell and AMD's Zen have 168 entry integer register files.  (This means 1344 total in an eight core chip.  Note that the renaming registers are shared between two threads per core. There are only 160 floating-point renaming registers in Zen--1280 total.  This is eight per core shy of Intel's Skylake and Kaby Lake.)
> 
> 
> So what can happen that is a potential problem?  All those renaming registers, mean that the most recent written value of Write_Index can trail writes to the FIFO queue.  Normally this will only happen when you inline code, and there are multiple instances of the write routine inlined into a single (callable) procedure or function.  Now you can have several values in the FIFO which will not yet be readable.  As long as you are not worried about (nanosecond) latency, not an issue.  Worst case, the same thing happens on the read side, and the writing side needs to wait.  Being a former real-time programmer, you don't want the writes getting thrown on the floor.
> 
> You can have the same problem if you set the write side at a higher priority than the read side on a single core no multithreading CPU.  In either case, the solution is the same:  Make the FIFO large enough that such collisions (between the two indexes) don't happen.  This is the sort of issue that the proof of correctness folk don't see.  You can (and should) test for this in any hard real-time system, no matter how much theorem proving you do.  Assuming you do it right, now you have average and maximum queuing latencies to feed back into your theorem prover.  I'm not against theorem proving, I just know from experience that analysis and testing is also needed. ;-)
> 
> * Technically use a value before the pipe computing it has it ready for use.  Due to the nature of pipelined architectures, it matters not if that value is not ready until stage 10, if the instruction that needs it won't need it until (its) stage 9.  The CPU sees that as those two instructions cannot be dispatched on the same clock cycle, but otherwise the "store forwarding" makes it look like it only took one clock cycle to compute.  If you--and the CPU--are clever, you can even have the two micro-ops in the same pipe, one after the other.  If you are even more clever, you can choose micro-ops such that they can execute on the same clock cycle.
> 
> It is not that difficult to write the innermost loops of some programs to take only one clock cycle on average per iteration.  You have three or more instructions executing together, as parts of the loop for iterations n, n-1, and n-2.  So any iteration of the loop takes two clock cycles plus the pipeline depth, but the average iteration take very close to one clock cycle.
> 
> ** Well 64-bit integer registers or parts of one, plus some other registers that it makes sense to treat as just another integer register.  Floating point values have different register names, and their own large renaming register set.

I am still digesting some of this, so forgive the simple question:

What prevents the last 3 lines from being reordered like this:
r1 := r1 + 1;
(Write_Index'Address) := r1;
(r3) := r4;

I may have misunderstood your post, but isn't it free to make this adjustment 
to the instructions?  There is no dependence on r3 or r4 for the 
Write_Index'Address part or the r1 increment.  I ask because this very issue 
has (allegedly) been seen implementing lock free designs in other languages 
like C/C++.

If it is, then the risk is that after
(Write_Index'Address) := r1;

a task context switch or interrupt occurs and now the index is incremented
before the data was added (and now pointing to junk).

By the way, thank you for the very detailed explanation.  I can't say
that I fully understand it all yet (as probably indicated from my question).  
I'm probably gonna have to reread it a few times.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09  0:30               ` Jere
@ 2017-05-09  4:02                 ` Robert Eachus
  2017-05-09  4:32                 ` Robert Eachus
  2017-05-09 20:01                 ` Randy Brukardt
  2 siblings, 0 replies; 65+ messages in thread
From: Robert Eachus @ 2017-05-09  4:02 UTC (permalink / raw)


On Monday, May 8, 2017 at 8:30:02 PM UTC-4, Jere wrote:

First, oops!  I missed that the program computes Write_Index + 1 twice. The CPU won't, but th code sequence should be:

r1 := Write_Index; 
... 
r1 := r1 + 1;
r2 := r1 x Element'Length; 
r3 := r2 + FIFO'First; 
r4 := Element; 
(r3) := r4; 
(Write_Index'Address) := r1; 

(Parentheses write to the address.)
 
> What prevents the last 3 lines from being reordered like this:
> r1 := r1 + 1;
> (Write_Index'Address) := r1;
> (r3) := r4;

The rule against reordering of writes.  I think your question still stands, but that is the answer, moves to registers are reads, moves to memory are writes.  Reads can be moved before writes to other locations, but order within writes has to be maintained.

> I may have misunderstood your post, but isn't it free to make this adjustment 
> to the instructions?  There is no dependence on r3 or r4 for the 
> Write_Index'Address part or the r1 increment.  I ask because this very issue 
> has (allegedly) been seen implementing lock free designs in other languages 
> like C/C++.
> 
> If it is, then the risk is that after
> (Write_Index'Address) := r1;
> 
> a task context switch or interrupt occurs and now the index is incremented
> before the data was added (and now pointing to junk).

No. Or at least it should not happen.  A single register set exists when the interrupt is allow to occur.  This may have the (correct) value of Write_Index 
in memory (or not) or in r1 (or not).  If r1 has the old or new value, not a problem.  If the write to FIFO occurred, the write to Write_Index must have happened first (on an x86 or AMD64 CPU at least).  In other words, the write pipe will get emptied before the cache gets flushed.  Note that the cache flush does not occur until after the interrupt.*  On other (usually RISC) architectures, the interrupt service routine may have to execute a "fence" instruction to force the CPU to a consistent state. (x86 does have three fence instructions: I found an interesting writeup  that shows that they are very occasionally needed: https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/) Fences are more often needed on RISC CPUs, since the ordering rules are more relaxed.  That may be where the problem you noted occurred.  Especially in embedded controllers, there can be interrupt service routines not supplied by the OS.  Especially if the other half of this stream is implemented in the interrupt, the fence instruction is necessary.

In general do not use fences in Ada code, use protected operations.  At a minimum the lock needed for a cleverly written protected object can be replaced by a read-modify-write (RMW) instruction, and the CPU should execute it in under a nanosecond.  Code cannot be moved past RMW operations, and they are much faster than fences.

Why isn't a fence needed here?  When it comes to shared variables, one thread only writes.  But the example is flawed, in that we do have an access that needs to be protected.  To make the buffer a circular buffer you need to add a test and adjustment.  For simplicity? change:

 FIFO (Write_Index + 1) := Element; 
 Write_Index := Write_Index + 1;

>  to:

Temp := Write_Index;
if Temp = Max 
then Temp := 1;
else Temp := Temp + 1;
end if;
FIFO (Temp) := Element; 
Write_Index := Temp;

Making Temp explicit is a belt and suspenders operation.  The CPU will generate the same set of micro-ops, or at least the same set as once we make the buffer circular.  But now it should be clear that the two writes cannot be reordered. Note that the writes to Temp can be replaced by writes to R1 and are reads for the x86 rules.




...and the same for Read_Index. 

Simpler?  Huh?  What are we doing here?
The same thing we did in the pseudocode, but 
> I'm probably gonna have to reread it a few times.

I could probably teach a graduate course on this, and not include anything not touched on here.  And, yes I did teach a graduate course in compiler construction several times back before this revolution in how CPUs are put together.  If I was teaching that course now, I would either limit it to the front end and a bit of optimization, or stretch it into two semesters.  I won't say that this stuff makes quantum mechanics look easy, but they are at the same level of complexity.  (In fact you may have seen that Elon Musk among others believe that there is a good chance we are living in a simulation. The similarities are eerie, and the as ifs pop up in the same places.)

*  Especially if you are running on an AMD CPU in a virtual processor.  As long as the CPU stays in the hypervisor state, the cache doesn't need to be flushed.  That's a detail and a huge bite of extra complexity here.  It allows the hypervisor to service a page fault without flushing the cache.  Of course, if there is a context switch, even to the OS on the virtual machine, the cache flush is needed. Usually if the page is not present in memory the cache flush (and a context switch) happens, then the hypervisor turns control over to the "real" OS.  (Disks even SSDs are slow.) Otherwise the hypervisor patches up the page tables and lets the (faulting) instruction be retried never touching the OS.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09  0:30               ` Jere
  2017-05-09  4:02                 ` Robert Eachus
@ 2017-05-09  4:32                 ` Robert Eachus
  2017-05-09  4:44                   ` Robert Eachus
  2017-05-09 22:26                   ` Jere
  2017-05-09 20:01                 ` Randy Brukardt
  2 siblings, 2 replies; 65+ messages in thread
From: Robert Eachus @ 2017-05-09  4:32 UTC (permalink / raw)


Argh!  I just accidentally deleted a long post due apparently to a bug in the new version of Windows 10.  Anyway:

On Monday, May 8, 2017 at 8:30:02 PM UTC-4, Jere wrote:
> On Monday, May 8, 2017 at 7:24:42 PM UTC-4, Robert Eachus wrote:
> > On Monday, May 8, 2017 at 5:09:45 PM UTC-4, Dmitry A. Kazakov wrote:
> > 
> > > This strategy does not prevent reordering writes:
> > > 
> > >     FIFO (Write_Index + 1) := Element;
> > >     Write_Index := Write_Index + 1;
> > > 
> > > into
> > > 
> > >     Write_Index := Write_Index + 1;
> > >     FIFO (Write_Index) := Element;
> > > 
> > > which would be a bug.

That requires reordering writes, which is not allowed in x86 and AMD64.  There are fence instructions on RISC machines that may be needed.  There are three fence instructions on x86.  They are rarely used, and should be non-existent in user code.  This is an instance where you need a fence on x86, or just to change a couple of instructions to RMW (Atomic): https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/

> > What happens is that the code is broken into:
> > 
> > r1 := Write_Index;
> > ...
> > r2 := r1 x Element'Length;
> > r3 := r2 + FIFO'First;
> > r4 := Element;
> > (r3) := r4;
> > r1 := r1 + 1;
> > (Write_Index'Address) := r1;
> > 
> > (Parentheses write to the address.)

Oops! I misread the example.  The right code is:
 
 r1 := Write_Index;
 ...
 r1 := r1 + 1; -- added
 r2 := r1 x Element'Length;
 r3 := r2 + FIFO'First;
 r4 := Element;
 (r3) := r4;
 -- deleted from here.
 (Write_Index'Address) := r1;

The important point is that the two writes to memory cannot be reordered.  They will be put into the write pipe in that order.

> I am still digesting some of this, so forgive the simple question:
> 
> What prevents the last 3 lines from being reordered like this:
> r1 := r1 + 1;
> (Write_Index'Address) := r1;
> (r3) := r4;

See above.  Writes cannot be moved past writes.  In practice the write pipe does some magic combining writes to the same cache line, etc.  But it maintains the order on writes to cache.  If you have the right equipment you can peek and see that the writes to main memory are misordered, or even missing.  But without a signal analyzer or some such, you can't peek at main memory bypassing the cache. 

> By the way, thank you for the very detailed explanation.  I can't say
> that I fully understand it all yet (as probably indicated from my question).  
> I'm probably gonna have to reread it a few times.

The material here fleshed out is probably a full semester course.  I taught compiler construction as a graduate course in the past, and I would now insist on two semesters, or break it into front-end and code generation courses.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 21:09           ` Dmitry A. Kazakov
@ 2017-05-09  4:34             ` Niklas Holsti
  2017-05-09  6:16               ` Niklas Holsti
  0 siblings, 1 reply; 65+ messages in thread
From: Niklas Holsti @ 2017-05-09  4:34 UTC (permalink / raw)


On 17-05-09 00:09 , Dmitry A. Kazakov wrote:
> On 2017-05-08 22:29, Niklas Holsti wrote:
>> On 17-05-08 19:22 , Dmitry A. Kazakov wrote:
>>
>> [snip]
>>
>>> I don't see anything there what would prevent CPU from changing the
>>> order of accesses to two *independent* memory locations when only the
>>> program logic ties them nothing else. Well, except that there is no
>>> obvious reason why it would like to change that order.
>>
>> I quote from an earlier post of mine in this thread:
>>
>> Just as in C, volatile accesses in Ada are guaranteed to occur in the
>> order and number written in the source code and executed in the
>> "standard" order [moreover, this includes all tasks in the program]:
>>
>> RM C.6 (16/3): All tasks of the program (on all processors) that read or
>> update volatile variables see the same order of updates to the
>> variables.
>
> This is ambiguous. Does it read as:
>
> A. Order of updates per volatile variable
>
> B. Order of updates to the set of all volatile variables
>
> B is sufficient, A is not. Depending on the implementation A may imply B
> or not.

To me it is clear that B is meant, because the text uses the plural 
"variables".

If A were meant, the text would say "... that read or update a volatile 
variable see the same order of updates to the variable".

Also, remember that accesses to volatile variables are defined to be 
part of the "external effect" of the program, and all of the external 
effect must occur in the canonical order defined by the Ada semantics, 
without any reordering by compiler or processor.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09  4:32                 ` Robert Eachus
@ 2017-05-09  4:44                   ` Robert Eachus
  2017-05-09 22:26                   ` Jere
  1 sibling, 0 replies; 65+ messages in thread
From: Robert Eachus @ 2017-05-09  4:44 UTC (permalink / raw)


On Tuesday, May 9, 2017 at 12:32:38 AM UTC-4, Robert Eachus wrote:
Did it again, well posted before I was finished editing.  But this time I'm just going to add a few sentences:

When reading x86 rules, understand moves to registers as reads, moves to memory as writes even though they use the same op code.  Oh, and notice that the strict ordering on writes is on moves to cache, not on when they get copied back to main memory. Same goes for reads.  Reads and writes of different locations can be reordered, reads cannot be moved past reads, writes can not be moved past writes.

The rule on reordering of reads has a big gotcha.  The CPU can speculatively read from a memory location whenever it wants.  But it must not DO anything with that data until a real read comes along.  This allows the cache to pull in a complete cache line among other things.  And the most important part of not do anything is no interrupts.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09  4:34             ` Niklas Holsti
@ 2017-05-09  6:16               ` Niklas Holsti
  2017-05-09  8:34                 ` Dmitry A. Kazakov
  2017-05-09 20:18                 ` Randy Brukardt
  0 siblings, 2 replies; 65+ messages in thread
From: Niklas Holsti @ 2017-05-09  6:16 UTC (permalink / raw)


On 17-05-09 07:34 , Niklas Holsti wrote:
> On 17-05-09 00:09 , Dmitry A. Kazakov wrote:
>> On 2017-05-08 22:29, Niklas Holsti wrote:
>>> On 17-05-08 19:22 , Dmitry A. Kazakov wrote:
>>>
>>> [snip]
>>>
>>>> I don't see anything there what would prevent CPU from changing the
>>>> order of accesses to two *independent* memory locations when only the
>>>> program logic ties them nothing else. Well, except that there is no
>>>> obvious reason why it would like to change that order.
>>>
>>> I quote from an earlier post of mine in this thread:
>>>
>>> Just as in C, volatile accesses in Ada are guaranteed to occur in the
>>> order and number written in the source code and executed in the
>>> "standard" order [moreover, this includes all tasks in the program]:
>>>
>>> RM C.6 (16/3): All tasks of the program (on all processors) that read or
>>> update volatile variables see the same order of updates to the
>>> variables.
>>
>> This is ambiguous. Does it read as:
>>
>> A. Order of updates per volatile variable
>>
>> B. Order of updates to the set of all volatile variables
>>
>> B is sufficient, A is not. Depending on the implementation A may imply B
>> or not.
>
> To me it is clear that B is meant, because the text uses the plural
> "variables".
>
> If A were meant, the text would say "... that read or update a volatile
> variable see the same order of updates to the variable".
>
> Also, remember that accesses to volatile variables are defined to be
> part of the "external effect" of the program, and all of the external
> effect must occur in the canonical order defined by the Ada semantics,
> without any reordering by compiler or processor.

Hmm... there may be some support for interpretation A in the Annotated RM.

In the Annotated RM, the following notes follow C.6 (16/3) and IMO 
support the view that this rule is meant to ensure a consistent access 
order over all volatile variables, that is, interpretation B (note also 
that some sentences speak or variables in plural, others speak of a 
single variable, suggesting that the choice of plural or singular is 
intentional and carries meaning):

- 16.a/3 Implementation Note: {AI05-0117-1} {AI05-0275-1} To ensure 
this, on a multiprocessor, any read or update of an atomic object may 
require the use of an appropriate memory barrier.

- 16.b/3 Discussion: {AI05-0275-1} From 9.10 it follows that (in 
non-erroneous programs) accesses to variables, including those shared by 
multiple tasks, are always sequential. This guarantees that no task will 
ever see partial updates of any variable. For volatile variables 
(including atomic variables), the above rule additionally specifies that 
all tasks see the same order of updates.

- 16.c/3 {AI05-0275-1} If for a shared variable X, a read of X occurs 
sequentially after an update of X, then the read will return the updated 
value if X is volatile or atomic, but may or or may not return the 
updated value if X is nonvolatile. For nonvolatile accesses, a signaling 
action is needed in order to share the updated value.

("Sequentially" above refers to RM 9.10, I believe.)

On the other hand, the final note on (16/3) in the ARM, below, speaks of 
the order imposed by accesses to a _single_ atomic variable ("the same 
atomic variable"), which may suggest interpretation A:

- 16.d/3 {AI05-0275-1} Because accesses to the same atomic variable by 
different tasks establish a sequential order between the actions of 
those tasks, implementations may be required to emit memory barriers 
around such updates or use atomic instructions that imply such barriers.

It would be nice to hear from some ARG member re the correct 
interpretation of RM C.6 (16/3), as this seems to define whether pragma 
Volatile is enough to let us implement lock-free synchronisation routines.

-- 
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09  0:05         ` Jere
@ 2017-05-09  8:26           ` Dmitry A. Kazakov
  0 siblings, 0 replies; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-09  8:26 UTC (permalink / raw)


On 09/05/2017 02:05, Jere wrote:
> On Monday, May 8, 2017 at 12:22:55 PM UTC-4, Dmitry A. Kazakov wrote:
>> For practical use when the buffer size is > 1 there cannot be any problem.
>
> I saw you mention this earlier.  Why does the buffer size prevent that problem?

I did not analyzed all variants, maybe I am wrong. The idea is that the 
modified index is used only for tests. When reading occur at another 
index then premature increment of the former does not matter.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09  6:16               ` Niklas Holsti
@ 2017-05-09  8:34                 ` Dmitry A. Kazakov
  2017-05-09 20:18                 ` Randy Brukardt
  1 sibling, 0 replies; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-09  8:34 UTC (permalink / raw)


On 09/05/2017 08:16, Niklas Holsti wrote:
> On 17-05-09 07:34 , Niklas Holsti wrote:
>> On 17-05-09 00:09 , Dmitry A. Kazakov wrote:
>>> On 2017-05-08 22:29, Niklas Holsti wrote:
>>>> On 17-05-08 19:22 , Dmitry A. Kazakov wrote:
>>>>
>>>> [snip]
>>>>
>>>>> I don't see anything there what would prevent CPU from changing the
>>>>> order of accesses to two *independent* memory locations when only the
>>>>> program logic ties them nothing else. Well, except that there is no
>>>>> obvious reason why it would like to change that order.
>>>>
>>>> I quote from an earlier post of mine in this thread:
>>>>
>>>> Just as in C, volatile accesses in Ada are guaranteed to occur in the
>>>> order and number written in the source code and executed in the
>>>> "standard" order [moreover, this includes all tasks in the program]:
>>>>
>>>> RM C.6 (16/3): All tasks of the program (on all processors) that
>>>> read or
>>>> update volatile variables see the same order of updates to the
>>>> variables.
>>>
>>> This is ambiguous. Does it read as:
>>>
>>> A. Order of updates per volatile variable
>>>
>>> B. Order of updates to the set of all volatile variables
>>>
>>> B is sufficient, A is not. Depending on the implementation A may imply B
>>> or not.
>>
>> To me it is clear that B is meant, because the text uses the plural
>> "variables".
>>
>> If A were meant, the text would say "... that read or update a volatile
>> variable see the same order of updates to the variable".
>>
>> Also, remember that accesses to volatile variables are defined to be
>> part of the "external effect" of the program, and all of the external
>> effect must occur in the canonical order defined by the Ada semantics,
>> without any reordering by compiler or processor.
>
> Hmm... there may be some support for interpretation A in the Annotated RM.
>
> In the Annotated RM, the following notes follow C.6 (16/3) and IMO
> support the view that this rule is meant to ensure a consistent access
> order over all volatile variables, that is, interpretation B (note also
> that some sentences speak or variables in plural, others speak of a
> single variable, suggesting that the choice of plural or singular is
> intentional and carries meaning):
>
> - 16.a/3 Implementation Note: {AI05-0117-1} {AI05-0275-1} To ensure
> this, on a multiprocessor, any read or update of an atomic object may
> require the use of an appropriate memory barrier.
>
> - 16.b/3 Discussion: {AI05-0275-1} From 9.10 it follows that (in
> non-erroneous programs) accesses to variables, including those shared by
> multiple tasks, are always sequential. This guarantees that no task will
> ever see partial updates of any variable. For volatile variables
> (including atomic variables), the above rule additionally specifies that
> all tasks see the same order of updates.
>
> - 16.c/3 {AI05-0275-1} If for a shared variable X, a read of X occurs
> sequentially after an update of X, then the read will return the updated
> value if X is volatile or atomic, but may or or may not return the
> updated value if X is nonvolatile. For nonvolatile accesses, a signaling
> action is needed in order to share the updated value.
>
> ("Sequentially" above refers to RM 9.10, I believe.)
>
> On the other hand, the final note on (16/3) in the ARM, below, speaks of
> the order imposed by accesses to a _single_ atomic variable ("the same
> atomic variable"), which may suggest interpretation A:
>
> - 16.d/3 {AI05-0275-1} Because accesses to the same atomic variable by
> different tasks establish a sequential order between the actions of
> those tasks, implementations may be required to emit memory barriers
> around such updates or use atomic instructions that imply such barriers.
>
> It would be nice to hear from some ARG member re the correct
> interpretation of RM C.6 (16/3), as this seems to define whether pragma
> Volatile is enough to let us implement lock-free synchronisation routines.

The interpretation B would mean that having pragma Atomic on the read 
and write indices should be sufficient for the FIFO implementation. And 
it will be portable.

P.S. It would be a great help to have a hard requirement in ARM to 
support Atomic for all scalar types, e.g. Stream_Element_Offset.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-06 14:17   ` Jere
                       ` (2 preceding siblings ...)
  2017-05-06 19:41     ` Jeffrey R. Carter
@ 2017-05-09 19:49     ` Randy Brukardt
  2017-05-09 22:07       ` Jere
  2017-05-10 18:28       ` Shark8
  3 siblings, 2 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-09 19:49 UTC (permalink / raw)


"Jere" <jhb.chat@gmail.com> wrote in message 
news:36434cd8-bc44-4d4a-957d-987fdf106be7@googlegroups.com...
...
> Thanks for the response!  I'm good with all that.  I'm more worried about
> situations like these:
>
> Task A:
> -- Do some checks to make sure buffer not full
>
> -- Not full, so put data
> My_Array(Put_Index) := Some_Value;
> if Put_Index = Buffer_Size then
>   Put_Index := 0;
> else
>   Put_Index := Put_Index + 1;
> end if;
>
> Task B:
> if Put_Index = Get_Index then -- empty
>   return False;
> end if;
>
> Result_Value := My_Array(Get_Index);
> -- Do other operations that update Get_Index
>
> Assuming the indexes are atomic and volatile, then I am not worried about
> partial writes to the indexes.  I"m more worried about some architectures 
> that
> allow for instruction reordering.

Atomic implies sequential access for atomic AND volatile objects (note that 
volatile does NOT imply that). See 9.10 if you dare. (And if you understand 
9.10, please join the ARG, we need you. ;-)

In particular, Atomic implies that any needed memory fences are used. 
(Again, volatile alone does not make such a requirement.) The idea is that 
one uses atomic objects to protect larger volatile data structures.

Caveat: that's an untestable requirement, so you're totally at the mercy of 
your compiler vendor to get this right.

                                Randy.

P.S. Yes, we've discussed this in the ARG. We believe the RM description has 
the right effect, but the issues and wording are both hairy, so we could be 
wrong.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 16:22       ` Dmitry A. Kazakov
                           ` (3 preceding siblings ...)
  2017-05-09  0:05         ` Jere
@ 2017-05-09 19:53         ` Randy Brukardt
  2017-05-09 20:27           ` Dmitry A. Kazakov
  4 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-09 19:53 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:oeq60s$18c1$1@gioia.aioe.org...
> On 2017-05-08 17:56, Robert Eachus wrote:
>> One sentence summary: The CPU will play lots of tricks behind your
>> back, but in such a way you can pretend your code works as you intended.
>> (Er, after you get the bugs out.)
>>
>> On Monday, May 8, 2017 at 3:45:26 AM UTC-4, Dmitry A. Kazakov wrote:
>>
>>> I don't believe either of the sequences can get reordered by the CPU,
>>> but it is only belief and common sense.
>>
>> Silly rabbit, Trix are for kids. Oops, silly programmer, common
>> sense  has nothing to do with modern CPUs.
> [...]
>> Is any of this important to a high-level language programmer? Not
>> really, unless they are working in hard real time. (Which I did.) If you
>> program in assembler, or more likely work on a compiler back-end,
>> knowing which instructions fit together is important. Even more
>> important is coding so that an instruction that references a new cache
>> line in memory is followed by as many instructions as possible that
>> don't use that value. Some times you run into cases where you want to
>> take advantage of write combining or some other trick involving the
>> write pipe--but today, smart compiler optimization is about accesses to
>> main memory and not much else.
>
> I don't see anything there what would prevent CPU from changing the order 
> of accesses to two *independent* memory locations when only the program 
> logic ties them nothing else. Well, except that there is no obvious reason 
> why it would like to change that order.
>
> Wikipedia provides a more simple example of the problem:
>
>    https://en.wikipedia.org/wiki/Memory_barrier
>
> The case of a lock-free FIFO is different because it involves index and 
> array element.
>
> For practical use when the buffer size is > 1 there cannot be any problem.
>
> Yet the question stands. In particular, which Ada primitives may ensure 
> safe implementations of lock-free structures when no protected objects 
> used.

Atomic objects make such a guarentee. Nothing else does (well, ignoring 
protected objects, of course, since you excluded them). End of story. 
(Except that compilers might not implement it right, as I noted before.)

                             Randy.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 23:24             ` Robert Eachus
  2017-05-09  0:30               ` Jere
@ 2017-05-09 19:57               ` Randy Brukardt
  2017-05-10  0:51                 ` Jere
  2017-05-10 16:19                 ` Robert Eachus
  1 sibling, 2 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-09 19:57 UTC (permalink / raw)


"Robert Eachus" <rieachus@comcast.net> wrote in message 
news:077e7f6a-5a7b-4b88-a16f-7672aec18a17@googlegroups.com...
...
>The compiler will play some games, and certainly will leave Write_Index in 
>a register if possible.

It better not is Write_Index is Atomic.

Which is my point on this discussion: what the CPU does is irrelevant if the 
compiler is implemented correctly. If it isn't, then the compiler is wrong. 
QED.

Only the compiler vendor need worry about this level of discussion. 
Honestly, I doubt anyone else is qualified. (I'm not, and I AM a compiler 
vendor... ;-)

                                     Randy.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09  0:30               ` Jere
  2017-05-09  4:02                 ` Robert Eachus
  2017-05-09  4:32                 ` Robert Eachus
@ 2017-05-09 20:01                 ` Randy Brukardt
  2 siblings, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-09 20:01 UTC (permalink / raw)



"Jere" <jhb.chat@gmail.com> wrote in message 
news:8a968aae-79e4-421b-ba4b-e0a9a33ce0db@googlegroups.com...
...
>I am still digesting some of this, so forgive the simple question:

Don't bother, it's irrelevant if you trust your compiler vendor. If you 
don't, then you're out of luck on modern CPUs because mere mortals cannot 
understand them.

>What prevents the last 3 lines from being reordered like this:
>r1 := r1 + 1;
>(Write_Index'Address) := r1;
>(r3) := r4;

Assuming again that Write_Index is atomic as is necessary to make any 
guarentees, this translation is 100% wrong (can't put Write_Index into a 
register), so reordering is irrelevant. If the CPU is going to do it anyway, 
then the compiler has to do whatever the CPU vendor provided to prevent it.

If Write_Index isn't atomic, then you can make no assumptions about how the 
code is executed and hopefully you aren't (it is purely sequential code).

                              Randy.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-08 20:29         ` Niklas Holsti
  2017-05-08 21:09           ` Dmitry A. Kazakov
@ 2017-05-09 20:10           ` Randy Brukardt
  1 sibling, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-09 20:10 UTC (permalink / raw)


"Niklas Holsti" <niklas.holsti@tidorum.invalid> wrote in message 
news:enc2orF3oloU1@mid.individual.net...
...
>> Yet the question stands. In particular, which Ada primitives may ensure
>> safe implementations of lock-free structures when no protected objects
>> used.
>
> Pragma Volatile + RM C.6 (16/3)? If these are not sufficient, why not?

It's very subtle, but you need something atomic involved, else the rule for 
volatile is meaningless. It took a long discussion by a (now former) AdaCore 
person for me to understand it. Essentially, all shared object accesses 
(that is, from different tasks) is erroneous unless something makes the 
accesses "sequential" as defined in 9.10. Atomic makes acccesses 
"sequential" (thus requiring "fences" on some CPUs), volatile does not.

The second sentence of C.6(16/3) is trying to point this out.

Since not everything can be atomic, you usually have to use some sort of 
mixed atomic and volatile coding.

Again, as I've warned multiple times, this is untestable by the ACATS, so 
you have to go totally on trust that the vendor has implemented it right.

Eachus's lengthy discussion might be interesting if you are implementing an 
Ada compiler, but otherwise, just use the Ada tools provided and trust that 
your vendor gets them right. There's really no other choice because the odds 
of your getting it right when the vendor didn't is very low (as this stuff 
is at the edge of human understanding).

                                   Randy.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09  6:16               ` Niklas Holsti
  2017-05-09  8:34                 ` Dmitry A. Kazakov
@ 2017-05-09 20:18                 ` Randy Brukardt
  1 sibling, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-09 20:18 UTC (permalink / raw)


"Niklas Holsti" <niklas.holsti@tidorum.invalid> wrote in message 
news:end561F9mocU1@mid.individual.net...
...
> It would be nice to hear from some ARG member re the correct 
> interpretation of RM C.6 (16/3), as this seems to define whether pragma 
> Volatile is enough to let us implement lock-free synchronisation routines.

(You shouldn't have these discussions while I'm on vacation. ;-)

No.

You miss the point about "non-erroneous programs" in the AARM note. Almost 
all accesses to shared variables by multiple tasks is erroneous, unless 
something makes them sequential. That something can be a rendezvous, a 
protected action, or an atomic object read/write. Without one of these, 
volatile guarentees nothing.

AARM C.6(16.c/3) was my attempt to explain the difference between volatile 
and non-volatile, since in most instances there isn't one. (The main answer 
is that a volatile object can be trusted in another task after an access to 
an atomic object, and a regular object cannot; otherwise they're identical.)

AARM C.6(16.d/3) answers the OPs question: uses of an atomic object implies 
barriers/fences if one is needed.

Note that volatile as a second, separate meaning that should never have been 
mixed in to the tasking one (that is, access to memory-mapped 
hardware/software). That has different requirements at this sort of edge. 
(There's a proposal on the table to split volatile into parts to clarify 
which kind of use is intended.)

                                    Randy.




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09 19:53         ` Randy Brukardt
@ 2017-05-09 20:27           ` Dmitry A. Kazakov
  2017-05-11  0:35             ` Randy Brukardt
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-09 20:27 UTC (permalink / raw)


On 2017-05-09 21:53, Randy Brukardt wrote:
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
>> Yet the question stands. In particular, which Ada primitives may ensure
>> safe implementations of lock-free structures when no protected objects
>> used.
>
> Atomic objects make such a guarentee. Nothing else does (well, ignoring
> protected objects, of course, since you excluded them). End of story.

Glad to hear that!

> (Except that compilers might not implement it right, as I noted before.)

So far I didn't experience problems with my implementations of lock-free 
FIFO and blackboard compiled with GNAT for x86 and ARM7.

You are right noting that this is not testable (OK, one could fully test 
it under an emulator in simulated time, but who tests the emulator?).

Yet, surprisingly, on real examples the error shows itself pretty 
quickly. The background is that since GNAT does not support pragma 
Atomic for 64-bit integers on 32-bit machines I have the implementation 
varying, selected by GNAT project scenario. E.g. GCC built-in operations 
are used instead of pragma Atomic. When the selected scenario is wrong 
the error shows itself almost instantly.

So the conclusion is that GNAT seems OK.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09 19:49     ` Randy Brukardt
@ 2017-05-09 22:07       ` Jere
  2017-05-11  1:14         ` Randy Brukardt
  2017-05-10 18:28       ` Shark8
  1 sibling, 1 reply; 65+ messages in thread
From: Jere @ 2017-05-09 22:07 UTC (permalink / raw)


On Tuesday, May 9, 2017 at 3:49:29 PM UTC-4, Randy Brukardt wrote:
> "Jere" wrote in message 
> ...
> > Thanks for the response!  I'm good with all that.  I'm more worried about
> > situations like these:
> >
> > Task A:
> > -- Do some checks to make sure buffer not full
> >
> > -- Not full, so put data
> > My_Array(Put_Index) := Some_Value;
> > if Put_Index = Buffer_Size then
> >   Put_Index := 0;
> > else
> >   Put_Index := Put_Index + 1;
> > end if;
> >
> > Task B:
> > if Put_Index = Get_Index then -- empty
> >   return False;
> > end if;
> >
> > Result_Value := My_Array(Get_Index);
> > -- Do other operations that update Get_Index
> >
> > Assuming the indexes are atomic and volatile, then I am not worried about
> > partial writes to the indexes.  I"m more worried about some architectures 
> > that
> > allow for instruction reordering.
> 
> Atomic implies sequential access for atomic AND volatile objects (note that 
> volatile does NOT imply that). See 9.10 if you dare. (And if you understand 
> 9.10, please join the ARG, we need you. ;-)
> 

Thanks for all the replies.  That is interesting. I had assumed Ada's definition 
was similar to C or C++, but indeed it is different.  The C and C++ definitions 
of volatile guarantee that the compiler will not reorder load/store operations 
of volatile variables.  From this statement it sounds like Ada makes no such 
guarantee with Volatile.  NOTE: The C and C++ definitions hold no guarantee 
versus CPU instruction reordering like Ada does with pragma Atomic.

I got a test going today and tried it on two different GNAT implementations on 
two different processors.  I found that the GPL version of GNAT (gcc 4.9.3) 
works as you state.  Atomic prevents changes in instruction sequence. However, 
my older version (FSF GCC 4.1.2) does not.  Atomic had no affect on sequence 
(from a CPU perspective at least).  I am guessing that Atomic's definition was 
weaker back then.

Again, thanks for this that helps.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09  4:32                 ` Robert Eachus
  2017-05-09  4:44                   ` Robert Eachus
@ 2017-05-09 22:26                   ` Jere
  1 sibling, 0 replies; 65+ messages in thread
From: Jere @ 2017-05-09 22:26 UTC (permalink / raw)


On Tuesday, May 9, 2017 at 12:32:38 AM UTC-4, Robert Eachus wrote:
> > I am still digesting some of this, so forgive the simple question:
> > 
> > What prevents the last 3 lines from being reordered like this:
> > r1 := r1 + 1;
> > (Write_Index'Address) := r1;
> > (r3) := r4;
> 
> See above.  Writes cannot be moved past writes.  In practice the write pipe does some magic combining writes to the same cache line, etc.  But it maintains the order on writes to cache.  If you have the right equipment you can peek and see that the writes to main memory are misordered, or even missing.  But without a signal analyzer or some such, you can't peek at main memory bypassing the cache. 

Again, thanks for the responses.

I wanted to get back to this.  I wanted to do some testing, so I generated an 
example of two tasks that each update an index and then copy another (2 stores 
and a load).  I ran it on two different intel processors (all I had to test 
with): i7-4702MQ and i7-6700.  In both cases, there was CPU instruction 
reordering on the two stores (generated assembly had them in the correct order).  
ON the first processor I was running GCC 4.1.2 and pragma Atomic was not working 
as Randy described, so I had to insert a makeshift fence (I declared a protected 
object), which prevented the reordering.  On the second processor I was running 
GCC 4.9.3 which properly inserted synchronization primitives to prevent CPU 
instruction reordering, but if I enabled the GNAT pragma to turn off those 
primitives, it operated like the GCC 4.1.2 version.

I'll put the test code below.  I hope it illustrates the issue I was trying
to describe earlier.

Test_Reorder.ads
********************************************************************************
package Test_Reorder is
   procedure Run;
end Test_Reorder;


Test_Reorder.adb
********************************************************************************
pragma Ada_95;

with Ada.Numerics.Discrete_Random;
with Ada.Text_IO;

package body Test_Reorder is
   
   X      : Integer := 0; 
   Y      : Integer := 0; 
   X_Copy : Integer := 0; 
   Y_Copy : Integer := 0; 
   
   -- In GCC 4.9.3, these pragmas are sufficient
   -- to prevent reordering.  Not in GCC 4.1.2 though
   pragma Atomic(X);
   pragma Atomic(Y);     
   pragma Atomic(X_Copy);
   pragma Atomic(Y_Copy);
   
   -- Uncomment these to get CPU instruction reordering in 4.9.3
--     pragma Disable_Atomic_Synchronization(X);
--     pragma Disable_Atomic_Synchronization(Y);
--     pragma Disable_Atomic_Synchronization(X_Copy);
--     pragma Disable_Atomic_Synchronization(Y_Copy);
   
   package Positive_Random is new Ada.Numerics.Discrete_Random(Positive);
   
   Generator : Positive_Random.Generator;
   
   function Random_Number return Natural is
   begin
      return Positive_Random.Random(Generator) mod 8;
   end Random_Number;
   
   task Task1 is
      entry Start;
      entry Stop;
   end Task1;
   
   task Task2 is
      entry Start;
      entry Stop;
   end Task2;
   
   task body Task1 is
   begin
      loop
         accept Start;
         -- Delay a random amount of time
         while Random_Number /= 0 loop
            null;
         end loop;
         
         X := 1;  -- In GCC 4.1.2 I need to insert a fence after this
         Y_Copy := Y;
         
         accept Stop;         
      end loop;
   end Task1;
   
   task body Task2 is
   begin
      loop
         accept Start;
         -- Delay a random amount of time
         while Random_Number /= 0 loop
            null;
         end loop;
         
         Y := 1;   -- In GCC 4.1.2 I need to insert a fence after this
         X_Copy := X;
         
         accept Stop;         
      end loop;
   end Task2;
   
   procedure Run is 
      Detections : Natural := 0;
      Iterations : Natural := 0;
   begin
      Positive_Random.Reset(Generator);
      
      loop
         X := 0;
         Y := 0;
         Task1.Start;
         Task2.Start;
         
         Task1.Stop;
         Task2.Stop;
         
         Iterations := Iterations + 1;

         -- If both are 0, then at least one got reordered
         if X_Copy = 0 and Y_Copy = 0 then
            Detections := Detections + 1;
            Ada.Text_IO.Put_Line
               (Integer'Image(Detections)
                & " reorders found in "
                & Integer'Image(Iterations)
                & " iterations");
         end if;
         
      end loop;
   end Run;
end Test_Reorder;

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09 19:57               ` Randy Brukardt
@ 2017-05-10  0:51                 ` Jere
  2017-05-10  5:25                   ` J-P. Rosen
                                     ` (2 more replies)
  2017-05-10 16:19                 ` Robert Eachus
  1 sibling, 3 replies; 65+ messages in thread
From: Jere @ 2017-05-10  0:51 UTC (permalink / raw)


On Tuesday, May 9, 2017 at 3:57:50 PM UTC-4, Randy Brukardt wrote:
> "Robert Eachus" wrote in message 
> ...
> >The compiler will play some games, and certainly will leave Write_Index in 
> >a register if possible.
> 
> It better not is Write_Index is Atomic.
> 
> Which is my point on this discussion: what the CPU does is irrelevant if the 
> compiler is implemented correctly. If it isn't, then the compiler is wrong. 
> QED.
> 
> Only the compiler vendor need worry about this level of discussion. 
> Honestly, I doubt anyone else is qualified. (I'm not, and I AM a compiler 
> vendor... ;-)
> 
>                                      Randy.

Is there a method besides Atomic?  If I am implementing a generic FIFO (lock 
free) and the FIFO elements are complex data types that may not be able to be 
atomic, do I have any other options or are protected objects my only way out?

As an example, if my FIFO was an array of some records that contained a 64 bit 
integer?  I know I can make the FIFO volatile, but as you indicated, that isn't 
enough.  I can't make it Atomic either.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10  0:51                 ` Jere
@ 2017-05-10  5:25                   ` J-P. Rosen
  2017-05-10 22:56                     ` Jere
  2017-05-10  7:13                   ` Dmitry A. Kazakov
  2017-05-10 16:38                   ` Jeffrey R. Carter
  2 siblings, 1 reply; 65+ messages in thread
From: J-P. Rosen @ 2017-05-10  5:25 UTC (permalink / raw)


Le 10/05/2017 à 02:51, Jere a écrit :
> Is there a method besides Atomic?  If I am implementing a generic FIFO (lock 
> free) and the FIFO elements are complex data types that may not be able to be 
> atomic, do I have any other options or are protected objects my only way out?

Protected objects, or for a lower level, Synchronous_Task_Control. I
know that you want lock free algorithms, but out of curiosity, did you
measure the cost of locking algorithms? IOW, is all of this worth the
trouble?

-- 
J-P. Rosen
Adalog
2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX
Tel: +33 1 45 29 21 52, Fax: +33 1 45 29 25 00
http://www.adalog.fr

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10  0:51                 ` Jere
  2017-05-10  5:25                   ` J-P. Rosen
@ 2017-05-10  7:13                   ` Dmitry A. Kazakov
  2017-05-10 16:45                     ` Robert Eachus
                                       ` (3 more replies)
  2017-05-10 16:38                   ` Jeffrey R. Carter
  2 siblings, 4 replies; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-10  7:13 UTC (permalink / raw)


On 10/05/2017 02:51, Jere wrote:

> Is there a method besides Atomic?  If I am implementing a generic FIFO (lock
> free) and the FIFO elements are complex data types that may not be able to be
> atomic, do I have any other options or are protected objects my only way out?

But you don't need elements atomic only index has to. If I understood 
Randy's explanation correctly atomic access to the buffer index will 
give you the barrier between index and element accesses which is all 
that is required.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09 19:57               ` Randy Brukardt
  2017-05-10  0:51                 ` Jere
@ 2017-05-10 16:19                 ` Robert Eachus
  2017-05-11  1:02                   ` Randy Brukardt
  1 sibling, 1 reply; 65+ messages in thread
From: Robert Eachus @ 2017-05-10 16:19 UTC (permalink / raw)


On Tuesday, May 9, 2017 at 3:57:50 PM UTC-4, Randy Brukardt wrote:
> "Robert Eachus" <rieachus@comcast.net> wrote in message 
> news:077e7f6a-5a7b-4b88-a16f-7672aec18a17@googlegroups.com...
> ...
> >The compiler will play some games, and certainly will leave Write_Index in 
> >a register if possible.
> 
> It better not is Write_Index is Atomic.

First, if FIFO has Atomic_Components, then it is not possible.  Atomic_Components would be overkill, but I haven't seen those declarations here.  If FIFO has just Volatile_Components (or no attribute) then this is possible.  But what is "this"?
> 
Sloppy wording, I should have said "keep a copy of Write_Index".  The write to Write_Index will occur to cache on modern systems where the whole system is cache-coherent.  But that is a detail.  What I was specifically referring to was that there are two subexpressions Write_Index+1.  As far as I can tell, if there are no synchronization points between them, the (overall system including) the compiler can do one read of Write_Index, then evaluate the expression once--or twice.

Why evaluate twice? (Other than the write to FIFO being a synchronization point?) Hardware  magic may combine the increment with the move, through a postincrement addressing mode. It would then write the incremented register copy to Write_Index.  One read, one write.  Same as if it did common subexpression elimination. It doesn't need to READ Write_Access twice.  I hope I'm right, because anything else is silly.  If a synchronization point did occur between the two reads, then the in register copy of Write_Index, or Write_Index + 1, if the implementation wants to keep that around instead, must be spoiled.

What if Write_Address is assigned to a special address which other hardware can access?  Doesn't matter.  Think of the cache system as treating addresses as names, and moving them around.  As long as everyone sees the same version of that name, the contents of the location named are the only reality we can be concerned with.  Since CPUs now contain the (physical) memory managers, as well as cache, the "external effect" of a program has to be interpreted as the view of memory offered by the CPU.  Nothing else is available.  (Writes to disk, or to the display screen are all memory writes from the CPU's viewpoint.)

There is another issue here, which we need to allow compilers to get right.  Assume that the records in FIFO are X bytes long.  A common compiler optimization is to use a value FIFO'Address + X x Write_Index instead of Write_Index to do the writes.  Is that legal?  Or better, under which attributes of FIFO and Write_Index is the system allowed to do that optimization?

> Only the compiler vendor need worry about this level of discussion. 
> Honestly, I doubt anyone else is qualified. (I'm not, and I AM a compiler 
> vendor... ;-)

I used to be a compiler writer, on at least three Ada compilers, but the deep level understanding of what is going on comes from my work on radar systems.  There access times in microseconds are important, and sometimes in nanoseconds. (I think the worst real-time requirement I ever saw was for 200 nanoseconds. Ouch!)


 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10  0:51                 ` Jere
  2017-05-10  5:25                   ` J-P. Rosen
  2017-05-10  7:13                   ` Dmitry A. Kazakov
@ 2017-05-10 16:38                   ` Jeffrey R. Carter
  2017-05-10 23:40                     ` Jere
  2 siblings, 1 reply; 65+ messages in thread
From: Jeffrey R. Carter @ 2017-05-10 16:38 UTC (permalink / raw)


On 05/10/2017 02:51 AM, Jere wrote:
>
> Is there a method besides Atomic?  If I am implementing a generic FIFO (lock
> free) and the FIFO elements are complex data types that may not be able to be
> atomic, do I have any other options or are protected objects my only way out?

If your indices are Atomic and your array has Volatile_Components, then I think 
things will work the way you want them to. The important reference here is ARM 
1.1.3, which says

* Any read or update of an atomic or volatile object is part of the program's 
external interactions.

* An Ada compiler must produce a program with external interactions in the order 
and timing specified by Ada's semantics.

Ada's semantics for a sequence of statements say the statements are executed in 
order.

So, if you have in task A

A (I + 1) := Item;
I := I + 1;

Then from the outside an observer must see in order

1. Read I
2. Write A (I + 1)
3. Read I
4. Write I

If task B does

if I > 0 then

then B's "Read I" will also be something the external observer can see. It might 
happen before 1., between 1. and 3., between 3. and 4., or after 4., but won't 
happen during any of 1, 3, or 4. That guarantees that B will always get a valid 
view of I.

Thanks to Holsti for pointing this out [through ARM C.6(20)]. My previous 
understanding of this was incorrect.

-- 
Jeff Carter
"Monsieur Arthur King, who has the brain of a duck, you know."
Monty Python & the Holy Grail
09


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10  7:13                   ` Dmitry A. Kazakov
@ 2017-05-10 16:45                     ` Robert Eachus
  2017-05-10 17:28                       ` Simon Wright
  2017-05-10 23:21                     ` Jere
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 65+ messages in thread
From: Robert Eachus @ 2017-05-10 16:45 UTC (permalink / raw)


On Wednesday, May 10, 2017 at 3:13:27 AM UTC-4, Dmitry A. Kazakov wrote:

> But you don't need elements atomic only index has to. If I understood 
> Randy's explanation correctly atomic access to the buffer index will 
> give you the barrier between index and element accesses which is all 
> that is required.

That is my reading as well, and should result in efficient implementations.  Also, for portable code, there are times when Volatile is required.  But you really need ancient (in Internet years) hardware for Volatile to have any effect at all. (At least in non-erroneous programs.)  In general, Atomic should be implemented as a RMW (read-modify-write) instruction on x86, and will add fences on any hardware that requires them.

The other part, which occurs in this code, is that (user) Atomic variables should only be written by one task/thread.  Hmm.  Compiler back-ends should probably insert a RMW of a non-program variable to cause synchronization where an Atomic variable is only being read.  Of course, this would only eliminate most erroneous executions.  Best is for programmers to use protected objects which can correctly (in an interrupt protected way) use RMWs for synchronization.  

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10 16:45                     ` Robert Eachus
@ 2017-05-10 17:28                       ` Simon Wright
  0 siblings, 0 replies; 65+ messages in thread
From: Simon Wright @ 2017-05-10 17:28 UTC (permalink / raw)


Robert Eachus <rieachus@comcast.net> writes:

> But you really need ancient (in Internet years) hardware for Volatile
> to have any effect at all.

I think Randy recently pointed out that Volatile should ideally apply
only to the memory-mapped i/o case; for the external effects.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09 19:49     ` Randy Brukardt
  2017-05-09 22:07       ` Jere
@ 2017-05-10 18:28       ` Shark8
  2017-05-11  1:17         ` Randy Brukardt
  1 sibling, 1 reply; 65+ messages in thread
From: Shark8 @ 2017-05-10 18:28 UTC (permalink / raw)


On Tuesday, May 9, 2017 at 1:49:29 PM UTC-6, Randy Brukardt wrote:
> 
> Atomic implies sequential access for atomic AND volatile objects (note that 
> volatile does NOT imply that). See 9.10 if you dare. (And if you understand 
> 9.10, please join the ARG, we need you. ;-)

I understood 9.10 -- how do I join the ARG?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10  5:25                   ` J-P. Rosen
@ 2017-05-10 22:56                     ` Jere
  2017-05-11  7:36                       ` Dmitry A. Kazakov
  0 siblings, 1 reply; 65+ messages in thread
From: Jere @ 2017-05-10 22:56 UTC (permalink / raw)


On Wednesday, May 10, 2017 at 1:26:00 AM UTC-4, J-P. Rosen wrote:
> Le 10/05/2017 à 02:51, Jere a écrit :
> > Is there a method besides Atomic?  If I am implementing a generic FIFO (lock 
> > free) and the FIFO elements are complex data types that may not be able to be 
> > atomic, do I have any other options or are protected objects my only way out?
> 
> Protected objects, or for a lower level, Synchronous_Task_Control. I
> know that you want lock free algorithms, but out of curiosity, did you
> measure the cost of locking algorithms? IOW, is all of this worth the
> trouble?
I wasn't necessarily aiming for a specific speed/cost.  I know I can easily 
find a standard task safe container and use that and it might be faster (or at 
least fast enough) on various platforms.  This is more for a different option 
from the normal.  It might be faster on some platforms or slower, but it is 
mostly meant to be a tool in toolkit.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10  7:13                   ` Dmitry A. Kazakov
  2017-05-10 16:45                     ` Robert Eachus
@ 2017-05-10 23:21                     ` Jere
  2017-05-11  0:47                       ` Randy Brukardt
  2017-05-10 23:30                     ` Jere
  2017-05-11  0:38                     ` Randy Brukardt
  3 siblings, 1 reply; 65+ messages in thread
From: Jere @ 2017-05-10 23:21 UTC (permalink / raw)


On Wednesday, May 10, 2017 at 3:13:27 AM UTC-4, Dmitry A. Kazakov wrote:
> On 10/05/2017 02:51, Jere wrote:
> 
> > Is there a method besides Atomic?  If I am implementing a generic FIFO (lock
> > free) and the FIFO elements are complex data types that may not be able to be
> > atomic, do I have any other options or are protected objects my only way out?
> 
> But you don't need elements atomic only index has to. If I understood 
> Randy's explanation correctly atomic access to the buffer index will 
> give you the barrier between index and element accesses which is all 
> that is required.

I may have misunderstood him, but from Randy's response, I gathered that 
Volatile has no effect on sequence, only Atomic does.  My understanding is that 
there are two parts to sequence:  compiler reordering and cpu reordering.  From 
Randy's response I gathered that Atomic guarantees no CPU rerodering between it 
and volatile objects (using fences or whatever mechanism provided).  If Volatile 
has no effect on Compiler reordering sequence, then even if Atomic can put 
fences between it and Volatile objects, if the compiler reorders before that 
happens, the fence won't help.  So I can safely know that the index operations 
are safe, but I don't know if any assignment before or after the indexing operation is ok or not since the buffer is only volatile. 

In particular, I am basing it off of this response from Randy:
***************************************************
(partial quote)
Atomic implies sequential access for atomic AND volatile objects (note that
volatile does NOT imply that). See 9.10 if you dare. (And if you understand
9.10, please join the ARG, we need you. ;-)

In particular, Atomic implies that any needed memory fences are used.
(Again, volatile alone does not make such a requirement.) The idea is that
one uses atomic objects to protect larger volatile data structures.  
***************************************************

So even if Atomic does use fences, if the compiler decides to reorder the 
statements before the binary is created, it my not help.

Did I misunderstand his response?  I might have.

It has been my own experience that GNAT will prevent "compiler" reordering on 
operations on Volatile objects, but that is probably just a GNAT extension.  

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10  7:13                   ` Dmitry A. Kazakov
  2017-05-10 16:45                     ` Robert Eachus
  2017-05-10 23:21                     ` Jere
@ 2017-05-10 23:30                     ` Jere
  2017-05-11  0:38                     ` Randy Brukardt
  3 siblings, 0 replies; 65+ messages in thread
From: Jere @ 2017-05-10 23:30 UTC (permalink / raw)


On Wednesday, May 10, 2017 at 3:13:27 AM UTC-4, Dmitry A. Kazakov wrote:
> On 10/05/2017 02:51, Jere wrote:
> 
> > Is there a method besides Atomic?  If I am implementing a generic FIFO (lock
> > free) and the FIFO elements are complex data types that may not be able to be
> > atomic, do I have any other options or are protected objects my only way out?
> 
> But you don't need elements atomic only index has to. If I understood 
> Randy's explanation correctly atomic access to the buffer index will 
> give you the barrier between index and element accesses which is all 
> that is required.

Followup to my previous email:
Perhaps it is easier for me to explain what I think is happening:

Key
Type => Prevents Compiler Reordering => Prevents CPU Reordering

After reading the various responses I am under the impression that:
Volatile => NO  => NO
Atomic   => YES => YES

I had hoped it was 
Volatile => YES => NO
Atomic   => YES => YES


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10 16:38                   ` Jeffrey R. Carter
@ 2017-05-10 23:40                     ` Jere
  0 siblings, 0 replies; 65+ messages in thread
From: Jere @ 2017-05-10 23:40 UTC (permalink / raw)


On Wednesday, May 10, 2017 at 12:38:54 PM UTC-4, Jeffrey R. Carter wrote:
> On 05/10/2017 02:51 AM, Jere wrote:
> >
> > Is there a method besides Atomic?  If I am implementing a generic FIFO (lock
> > free) and the FIFO elements are complex data types that may not be able to be
> > atomic, do I have any other options or are protected objects my only way out?
> 
> If your indices are Atomic and your array has Volatile_Components, then I think 
> things will work the way you want them to. The important reference here is ARM 
> 1.1.3, which says
> 
> * Any read or update of an atomic or volatile object is part of the program's 
> external interactions.
> 
> * An Ada compiler must produce a program with external interactions in the order 
> and timing specified by Ada's semantics.
> 
> Ada's semantics for a sequence of statements say the statements are executed in 
> order.
> 
> So, if you have in task A
> 
> A (I + 1) := Item;
> I := I + 1;
> 
> Then from the outside an observer must see in order
> 
> 1. Read I
> 2. Write A (I + 1)
> 3. Read I
> 4. Write I
> 
> If task B does
> 
> if I > 0 then
> 
> then B's "Read I" will also be something the external observer can see. It might 
> happen before 1., between 1. and 3., between 3. and 4., or after 4., but won't 
> happen during any of 1, 3, or 4. That guarantees that B will always get a valid 
> view of I.
Thanks!

My previous couple of emails might be moot now that I have read this response.  
Here I get the impression that Volatile does prevent Compiler statement 
reordering.  If that is the case, then I know between that and any fencing put 
up by Atomic objects will be sufficient.

So at this point my understanding is now:
Type => Prevents Compiler Reordering => Prevents CPU Reordering
Volatile => YES => NO
Atomic   => YES => YES

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09 20:27           ` Dmitry A. Kazakov
@ 2017-05-11  0:35             ` Randy Brukardt
  2017-05-11  8:24               ` Dmitry A. Kazakov
  0 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-11  0:35 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:oet8mj$13ri$1@gioia.aioe.org...
> On 2017-05-09 21:53, Randy Brukardt wrote:
...
>> (Except that compilers might not implement it right, as I noted before.)
>
> So far I didn't experience problems with my implementations of lock-free 
> FIFO and blackboard compiled with GNAT for x86 and ARM7.
>
> You are right noting that this is not testable (OK, one could fully test 
> it under an emulator in simulated time, but who tests the emulator?).
>
> Yet, surprisingly, on real examples the error shows itself pretty quickly. 
> The background is that since GNAT does not support pragma Atomic for 
> 64-bit integers on 32-bit machines I have the implementation varying, 
> selected by GNAT project scenario. E.g. GCC built-in operations are used 
> instead of pragma Atomic. When the selected scenario is wrong the error 
> shows itself almost instantly.

That doesn't surprise me that much. If you have a lock-free algorithm that 
gets used frequently, you're probably going to hit all of the possible 
states reasonably soon (assuming thousands/millions of operations get 
tested). The problem of course is that failure just causes erroneous 
execution, so formally one can't depend on anything that happens -- which of 
course includes displaying a failure message.

Still, it would be nice to try something in this area. One could imagine 
creating an Annex C test that implemented a lock-free algorithm and just 
tried to see if it worked properly. That certainly wouldn't catch 
everything, but it would catch some gross errors and at least put a 
spotlight on the rules in question. But that isn't my area of expertise; 
doing it wrong is definitely worse than not doing it at all (because of time 
investment to issue, then withdraw such a test after it is challenged).

If someone submitted a test in this area (hint, hint!) I definitely would be 
interested.

                                       Randy.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10  7:13                   ` Dmitry A. Kazakov
                                       ` (2 preceding siblings ...)
  2017-05-10 23:30                     ` Jere
@ 2017-05-11  0:38                     ` Randy Brukardt
  3 siblings, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-11  0:38 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:oeueil$knn$1@gioia.aioe.org...
> On 10/05/2017 02:51, Jere wrote:
>
>> Is there a method besides Atomic?  If I am implementing a generic FIFO 
>> (lock
>> free) and the FIFO elements are complex data types that may not be able 
>> to be
>> atomic, do I have any other options or are protected objects my only way 
>> out?
>
> But you don't need elements atomic only index has to. If I understood 
> Randy's explanation correctly atomic access to the buffer index will give 
> you the barrier between index and element accesses which is all that is 
> required.

Correct. The compiler can't motion (or let the CPU motion) volatile 
reads/writes across an atomic read/write. Thus you can depend on the order 
of volatile operations so long as an atomic operation is interleaved.

                           Randy.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10 23:21                     ` Jere
@ 2017-05-11  0:47                       ` Randy Brukardt
  2017-05-13 20:11                         ` Jere
  0 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-11  0:47 UTC (permalink / raw)


"Jere" <jhb.chat@gmail.com> wrote in message 
news:f441cc24-a41e-496e-970f-989a319a2a08@googlegroups.com...
> On Wednesday, May 10, 2017 at 3:13:27 AM UTC-4, Dmitry A. Kazakov wrote:
>> On 10/05/2017 02:51, Jere wrote:
...
> I may have misunderstood him, but from Randy's response, I gathered that
> Volatile has no effect on sequence, only Atomic does.

There are two "sequences" here, one is what the compiler is allowed to 
reorder (which is only non-volatile things, remember that atomic implies 
volatile), and one is what the CPU is allowed to reorder (which is anything 
other than atomic, the compiler has to do whatever is needed to avoid atomic 
reordering).
...
> In particular, I am basing it off of this response from Randy:
> ***************************************************
> (partial quote)
> Atomic implies sequential access for atomic AND volatile objects (note 
> that
> volatile does NOT imply that). See 9.10 if you dare. (And if you 
> understand
> 9.10, please join the ARG, we need you. ;-)
>
> In particular, Atomic implies that any needed memory fences are used.
> (Again, volatile alone does not make such a requirement.) The idea is that
> one uses atomic objects to protect larger volatile data structures.
> ***************************************************
>
> So even if Atomic does use fences, if the compiler decides to reorder the
> statements before the binary is created, it my not help.
>
> Did I misunderstand his response?  I might have.

Yes. I didn't say anything about compiler reordering, because it's crystal 
clear from C.6(20/5) and C.6(16/3) [and the earlier versions, too] that no 
compiler reordering is allowed. [Indeed, C.6(20/5) was too strong, 
preventing writing volatile bit-fields if taken literally (you have to do a 
pre-read to do a bit-field write, but that would clearly by a "memory read 
of a volatime object not specified by the program"), so it had to be 
weakened.]

It's the effect on CPU reordering that isn't very clear; the AARM notes 
C.6(16.a-d/3) are intended to clarify this, but not everyone reads the AARM.

                                         Randy.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10 16:19                 ` Robert Eachus
@ 2017-05-11  1:02                   ` Randy Brukardt
  2017-05-11  1:51                     ` Robert Eachus
  0 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-11  1:02 UTC (permalink / raw)


"Robert Eachus" <rieachus@comcast.net> wrote in message 
news:df31d3f8-636f-4702-affe-d5a0aa0c898a@googlegroups.com...
On Tuesday, May 9, 2017 at 3:57:50 PM UTC-4, Randy Brukardt wrote:
> "Robert Eachus" <rieachus@comcast.net> wrote in message
> news:077e7f6a-5a7b-4b88-a16f-7672aec18a17@googlegroups.com...
> ...
>> >The compiler will play some games, and certainly will leave Write_Index 
>> >in
>> >a register if possible.
>>
>> It better not is Write_Index is Atomic.
>
>First, if FIFO has Atomic_Components, then it is not possible. 
>Atomic_Components would be
>overkill, but I haven't seen those declarations here.  If FIFO has just 
>Volatile_Components (or
>no attribute) then this is possible.  But what is "this"?

????

Write_Index needs to be atomic in order for the OP's algorithm to work. 
Atomic objects fall under C.6(20/5) -- all reads and updates are included in 
the external effect of the program. Optimizing them out (in any way) is 
wrong (excepting of course a true as-if optimization, but that doesn't make 
any sense in this case). In particular, saving the value in a register is 
useless, because you still have to read the object again at the next 
reference.

>Sloppy wording, I should have said "keep a copy of Write_Index".  The write 
>to Write_Index
>will occur to cache on modern systems where the whole system is 
>cache-coherent.  But that is
>a detail.  What I was specifically referring to was that there are two 
>subexpressions
>Write_Index+1.  As far as I can tell, if there are no synchronization 
>points between them, the
>(overall system including) the compiler can do one read of Write_Index, 
>then evaluate the
>expression once--or twice.

Definitely not (again assuming Write_Index is atomic). One has to make the 
reads of memory for an atomic object, saving copies is wrong. And there is 
no real value to eliminating the +1 -- in-register operations are 
essentially free on modern architechtures.

Since the rest of your discussion is based on an incorrect premise, I'm 
going to skip it. (Other than to say, if you need the fastest possible code, 
don't make anything volatile or atomic! They're for synchronization, which 
necessarily is going to be slower than the usual code, since it has to be 
predictable.)

...
>> Only the compiler vendor need worry about this level of discussion.
>> Honestly, I doubt anyone else is qualified. (I'm not, and I AM a compiler
>> vendor... ;-)
>
>I used to be a compiler writer, on at least three Ada compilers, but the 
>deep level
>understanding of what is going on comes from my work on radar systems. 
>There
>access times in microseconds are important, and sometimes in nanoseconds. 
>(I think
>the worst real-time requirement I ever saw was for 200 nanoseconds. Ouch!)

I know that (and so am I), but my complaint was that the discussion isn't 
helpful to the OPs question (which is how to do this in hopefully portable 
Ada); it's the compiler's job to get this right and one hopes that the only 
time anyone needs to worry about this is if they have a bug that they think 
is caused by the compiler doing this wrong. (And this is when AdaCore 
support contracts pay for themselves, because you can dump the problem on 
them and let them figure it out. And it's the sort of problem that makes 
those contracts so expensive, because it is HARD work!!)

                               Randy.





^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-09 22:07       ` Jere
@ 2017-05-11  1:14         ` Randy Brukardt
  0 siblings, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-11  1:14 UTC (permalink / raw)


"Jere" <jhb.chat@gmail.com> wrote in message 
news:bcb9bcd3-34f2-4075-905d-bd1957657dc6@googlegroups.com...
> On Tuesday, May 9, 2017 at 3:49:29 PM UTC-4, Randy Brukardt wrote:
....
> I got a test going today and tried it on two different GNAT 
> implementations on
> two different processors.  I found that the GPL version of GNAT (gcc 
> 4.9.3)
> works as you state.  Atomic prevents changes in instruction sequence. 
> However,
> my older version (FSF GCC 4.1.2) does not.  Atomic had no affect on 
> sequence
> (from a CPU perspective at least).  I am guessing that Atomic's definition 
> was
> weaker back then.

It was never "weaker", but the implications on compilers were not as well 
understood (especially the ones about fences). AI05-0117-1 was the original 
discussion on this topic, it was mostly worked on in 2010. (It originated 
earlier, but all work was in 2010 and 2011.) AI05-0275-1 clarified this 
further in late 2011 and early 2012.

So it's likely that compilers implemented prior to 2010 don't implement any 
fences for atomic objects, mostly likely because the need for them wasn't 
understood. (I personally had no idea...) More recent compilers are much 
more likely to do it right.

Also note that the 2010 AI was classified as an Amendment, so it only 
applies to Ada 2012. I'm not sure why that was (the definition in Ada 2012 
is *weaker* than the one in earlier Ada versions), but it does mean that any 
Ada 95 or Ada 2005 implementations may not deal with this properly.

                                      Randy.





^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10 18:28       ` Shark8
@ 2017-05-11  1:17         ` Randy Brukardt
  2017-05-11 16:23           ` Jeffrey R. Carter
  0 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-11  1:17 UTC (permalink / raw)


"Shark8" <onewingedshark@gmail.com> wrote in message 
news:bca6700f-5cab-42a6-a055-c7747d0de6b1@googlegroups.com...
> On Tuesday, May 9, 2017 at 1:49:29 PM UTC-6, Randy Brukardt wrote:
>>
>> Atomic implies sequential access for atomic AND volatile objects (note 
>> that
>> volatile does NOT imply that). See 9.10 if you dare. (And if you 
>> understand
>> 9.10, please join the ARG, we need you. ;-)
>
> I understood 9.10 -- how do I join the ARG?

Talk to the chairperson, Jeff Cousins (ask me privately for his e-mail 
address; it's probably public information but I don't feel comfortable in 
posting anyone's address that way).

                               Randy.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-11  1:02                   ` Randy Brukardt
@ 2017-05-11  1:51                     ` Robert Eachus
  2017-05-15 22:45                       ` Randy Brukardt
  0 siblings, 1 reply; 65+ messages in thread
From: Robert Eachus @ 2017-05-11  1:51 UTC (permalink / raw)


On Wednesday, May 10, 2017 at 9:02:22 PM UTC-4, Randy Brukardt wrote:

> Write_Index needs to be atomic in order for the OP's algorithm to work. 
> Atomic objects fall under C.6(20/5) -- all reads and updates are included in 
> the external effect of the program. Optimizing them out (in any way) is 
> wrong (excepting of course a true as-if optimization, but that doesn't make 
> any sense in this case). In particular, saving the value in a register is 
> useless, because you still have to read the object again at the next 
> reference.

Even if there is no intervening read or write of anything?  It seems to me that must be an "as if," since there is no way to prove otherwise.  You may be thinking of reads of memory external to the CPU, part of some other hardware, where the other hardware can intervene.

Ah, the missing attribute:  Uncacheable.  Once upon a time, volatile implied that, but not for many years.  I guess there are still (embedded) processors out there where reading from a (memory mapped I/O) location changes its value, so I guess that is an issue for Ada, but only when running on bare metal.

What about spin-locks?  They have to have another action between reads to work.  Well you could pack all of a read, a test and say a jump non-zero all in one clock cycle.  But then there would be an extra read since the test would be on the prior version of the flags.  Put it into two clocks and the CPU can't be sure that the next read will occur.  In other words, it has to do the right number of reads which can affect flags, and no more.

Hmm. You need to read the (x86 and most other) rules for prefetching memory locations.  As I stated earlier in this thread, any CPU core is allowed to do a prefetch of any location, as long as it does not cause an action like an interrupt.  This is necessary, or you wouldn't be able to have any Atomic accesses. The CPU reads an entire cache line, usually 64, 128, or 256 bytes into the cache, but even the widest of CPU writes are only 64 bytes wide, and those are multiple floating-point values.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-10 22:56                     ` Jere
@ 2017-05-11  7:36                       ` Dmitry A. Kazakov
  2017-05-13 20:25                         ` Jere
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-11  7:36 UTC (permalink / raw)


On 11/05/2017 00:56, Jere wrote:
> On Wednesday, May 10, 2017 at 1:26:00 AM UTC-4, J-P. Rosen wrote:
>> Le 10/05/2017 à 02:51, Jere a écrit :
>>> Is there a method besides Atomic?  If I am implementing a generic FIFO (lock
>>> free) and the FIFO elements are complex data types that may not be able to be
>>> atomic, do I have any other options or are protected objects my only way out?
>>
>> Protected objects, or for a lower level, Synchronous_Task_Control. I
>> know that you want lock free algorithms, but out of curiosity, did you
>> measure the cost of locking algorithms? IOW, is all of this worth the
>> trouble?

> I wasn't necessarily aiming for a specific speed/cost.  I know I can easily
> find a standard task safe container and use that and it might be faster (or at
> least fast enough) on various platforms.  This is more for a different option
> from the normal.  It might be faster on some platforms or slower, but it is
> mostly meant to be a tool in toolkit.

One issue is granularity of locking. If container operation is not 
instant, and that depends on the nature of the elements, doing that on 
the context of protected action is a bad idea. The solution for 
protected objects is of using a mutex to guard the operation outside the 
protected action. The cost is two protected actions per operation 
instead of one.

A lock-free implementation does not have this issue. Note that whether 
the FIFO index is atomic or in a protected object is no matter. In both 
cases the method is lock-free.

So, Jere, when you become ARG (:-)), please, push for requirement to 
support Atomic for all scalar objects. If the machine lacks 
corresponding instructions the compiler must fall back to a protected 
object. And conversely, if there is a strong case that using a protected 
action might be more efficient than machine's atomic access 
instructions, the compiler should use the former.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-11  0:35             ` Randy Brukardt
@ 2017-05-11  8:24               ` Dmitry A. Kazakov
  2017-05-15 22:53                 ` Randy Brukardt
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-11  8:24 UTC (permalink / raw)


On 11/05/2017 02:35, Randy Brukardt wrote:

> Still, it would be nice to try something in this area. One could imagine
> creating an Annex C test that implemented a lock-free algorithm and just
> tried to see if it worked properly.

The real application used socked communication. The problem I suspect is 
that artificial tests would not show problems full-size application 
have. Maybe, a better strategy could be specific low-level tests for 
atomic objects used in lock-free algorithms.

And we really should extend the Annex C with some basic atomic 
operations like atomic increment, test-then-increment, 
test-then-exchange etc, all falling back to protected objects if the 
machine lacks corresponding instructions.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-11  1:17         ` Randy Brukardt
@ 2017-05-11 16:23           ` Jeffrey R. Carter
  0 siblings, 0 replies; 65+ messages in thread
From: Jeffrey R. Carter @ 2017-05-11 16:23 UTC (permalink / raw)


On 05/11/2017 03:17 AM, Randy Brukardt wrote:
> "Shark8" <onewingedshark@gmail.com> wrote in message
> news:bca6700f-5cab-42a6-a055-c7747d0de6b1@googlegroups.com...
>> On Tuesday, May 9, 2017 at 1:49:29 PM UTC-6, Randy Brukardt wrote:
>>>
>>> Atomic implies sequential access for atomic AND volatile objects (note
>>> that
>>> volatile does NOT imply that). See 9.10 if you dare. (And if you
>>> understand
>>> 9.10, please join the ARG, we need you. ;-)
>>
>> I understood 9.10 -- how do I join the ARG?
>
> Talk to the chairperson, Jeff Cousins (ask me privately for his e-mail
> address; it's probably public information but I don't feel comfortable in
> posting anyone's address that way).

9.10 doesn't seem that bad. C.6, on the other hand, ...

-- 
Jeff Carter
It's better to be root than to reboot.
119


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-11  0:47                       ` Randy Brukardt
@ 2017-05-13 20:11                         ` Jere
  2017-05-15 22:33                           ` Randy Brukardt
  0 siblings, 1 reply; 65+ messages in thread
From: Jere @ 2017-05-13 20:11 UTC (permalink / raw)


On Wednesday, May 10, 2017 at 8:47:23 PM UTC-4, Randy Brukardt wrote:
> "Jere" wrote in message 
> > On Wednesday, May 10, 2017 at 3:13:27 AM UTC-4, Dmitry A. Kazakov wrote:
> >> On 10/05/2017 02:51, Jere wrote:
> ...
> > I may have misunderstood him, but from Randy's response, I gathered that
> > Volatile has no effect on sequence, only Atomic does.
> 
> There are two "sequences" here, one is what the compiler is allowed to 
> reorder (which is only non-volatile things, remember that atomic implies 
> volatile), and one is what the CPU is allowed to reorder (which is anything 
> other than atomic, the compiler has to do whatever is needed to avoid atomic 
> reordering).
> ...
> > In particular, I am basing it off of this response from Randy:
> > ***************************************************
> > (partial quote)
> > Atomic implies sequential access for atomic AND volatile objects (note 
> > that
> > volatile does NOT imply that). See 9.10 if you dare. (And if you 
> > understand
> > 9.10, please join the ARG, we need you. ;-)
> >
> > In particular, Atomic implies that any needed memory fences are used.
> > (Again, volatile alone does not make such a requirement.) The idea is that
> > one uses atomic objects to protect larger volatile data structures.
> > ***************************************************
> >
> > So even if Atomic does use fences, if the compiler decides to reorder the
> > statements before the binary is created, it my not help.
> >
> > Did I misunderstand his response?  I might have.
> 
> Yes. I didn't say anything about compiler reordering, because it's crystal 
> clear from C.6(20/5) and C.6(16/3) [and the earlier versions, too] that no 
> compiler reordering is allowed. [Indeed, C.6(20/5) was too strong, 
> preventing writing volatile bit-fields if taken literally (you have to do a 
> pre-read to do a bit-field write, but that would clearly by a "memory read 
> of a volatime object not specified by the program"), so it had to be 
> weakened.]
> 
> It's the effect on CPU reordering that isn't very clear; the AARM notes 
> C.6(16.a-d/3) are intended to clarify this, but not everyone reads the AARM.
> 
>                                          Randy.

Ok, that makes sense.  I reread 1.1.3 on external effects (referenced from 
C.6(20)), and I think that helps clarify a lot of this for me.  Thanks!

I played around with it a bit more.  I feel like the RM is very piece-meal with 
how it defines things.  I have to find a bunch of different sections to piece 
together a full picture on how something is defined.  Not that it is wrong in 
any sense, but it definitely makes it challenging for someone who doesn't yet 
"know" the standard.  I'll get there in time.  Still it's helpful that the RM is 
in html format with links, though sometimes there are places lacking links to 
definitions (again, to things that once you "know" the standard make sense, but 
for someone new it can be a bit deep to wade through, especially if you are used 
to different terminology from other language standards).

Again, thanks!  Your responses helped clear it up.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-11  7:36                       ` Dmitry A. Kazakov
@ 2017-05-13 20:25                         ` Jere
  0 siblings, 0 replies; 65+ messages in thread
From: Jere @ 2017-05-13 20:25 UTC (permalink / raw)


On Thursday, May 11, 2017 at 3:36:09 AM UTC-4, Dmitry A. Kazakov wrote:
> On 11/05/2017 00:56, Jere wrote:
> > On Wednesday, May 10, 2017 at 1:26:00 AM UTC-4, J-P. Rosen wrote:
> >> Le 10/05/2017 à 02:51, Jere a écrit :
> >>> Is there a method besides Atomic?  If I am implementing a generic FIFO (lock
> >>> free) and the FIFO elements are complex data types that may not be able to be
> >>> atomic, do I have any other options or are protected objects my only way out?
> >>
> >> Protected objects, or for a lower level, Synchronous_Task_Control. I
> >> know that you want lock free algorithms, but out of curiosity, did you
> >> measure the cost of locking algorithms? IOW, is all of this worth the
> >> trouble?
> 
> > I wasn't necessarily aiming for a specific speed/cost.  I know I can easily
> > find a standard task safe container and use that and it might be faster (or at
> > least fast enough) on various platforms.  This is more for a different option
> > from the normal.  It might be faster on some platforms or slower, but it is
> > mostly meant to be a tool in toolkit.
>
> <SNIPPED>
> 
> So, Jere, when you become ARG (:-)), please, push for requirement to 
> support Atomic for all scalar objects. If the machine lacks 
> corresponding instructions the compiler must fall back to a protected 
> object. And conversely, if there is a strong case that using a protected 
> action might be more efficient than machine's atomic access 
> instructions, the compiler should use the former.
> 

Well,I believe Randy was offering up to Shark8, but it is definitely the type of 
thing I enjoy.  I'm very much into semantics when it comes to this type of 
stuff, though obviously I am no where near good enough in the Ada standard.  Nor 
am I experienced enough yet with enough big systems.  My experience is mainly in 
bare metal applications, where Ada is not very predominant (though not absent).  
I am much more comfortable in areas where there are no OS'es and the CPU's are 
fairly simple microcontrollers (8 to 32bit single core). 

I read through a lot of the posted ARG discussions/minutes, and I did find them 
pretty interesting.  I found myself disagreeing with some people and agreeing 
with others.  That said, I am very inexperienced and backseat ARG'ing is easy to 
do when I am in my position.  It's a difficult job I am sure!


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-13 20:11                         ` Jere
@ 2017-05-15 22:33                           ` Randy Brukardt
  0 siblings, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-15 22:33 UTC (permalink / raw)


"Jere" <jhb.chat@gmail.com> wrote in message 
news:a0357a77-39dd-4e24-8580-a71a9ca79531@googlegroups.com...
...
> I played around with it a bit more.  I feel like the RM is very piece-meal 
> with
> how it defines things.  I have to find a bunch of different sections to 
> piece
> together a full picture on how something is defined.  Not that it is wrong 
> in
> any sense, but it definitely makes it challenging for someone who doesn't 
> yet
> "know" the standard.  I'll get there in time.  Still it's helpful that the 
> RM is
> in html format with links, though sometimes there are places lacking links 
> to
> definitions (again, to things that once you "know" the standard make 
> sense, but
> for someone new it can be a bit deep to wade through, especially if you 
> are used
> to different terminology from other language standards).

We intend that every defined term appear in the index, so I'd look there if 
all else fails. (Certainly before starting a search.) The real danger point 
is things (like "part") that are technical terms but don't seem like it.

BTW, Atomic/Volatile are messy as they're defined in a specialized needs 
annex, so part of the definition is in the core language and part in the 
annex. (It's hard to imagine an Ada compiler not implementing most of annex 
C, though - I don't think than Atomic/Volatile should have been there but it 
would be messy to change.)

                                 Randy.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-11  1:51                     ` Robert Eachus
@ 2017-05-15 22:45                       ` Randy Brukardt
  0 siblings, 0 replies; 65+ messages in thread
From: Randy Brukardt @ 2017-05-15 22:45 UTC (permalink / raw)


"Robert Eachus" <rieachus@comcast.net> wrote in message 
news:fcf4a91a-bfdf-44ee-a888-7bb9a1c52b48@googlegroups.com...
On Wednesday, May 10, 2017 at 9:02:22 PM UTC-4, Randy Brukardt wrote:

>> Write_Index needs to be atomic in order for the OP's algorithm to work.
>> Atomic objects fall under C.6(20/5) -- all reads and updates are included 
>> in
>> the external effect of the program. Optimizing them out (in any way) is
>> wrong (excepting of course a true as-if optimization, but that doesn't 
>> make
>> any sense in this case). In particular, saving the value in a register is
>> useless, because you still have to read the object again at the next
>> reference.

>Even if there is no intervening read or write of anything?

Yes.

>It seems to me that must be an "as if," since there is no way to prove 
>otherwise.
>You may be thinking of reads of memory external to the CPU, part of some
>other hardware, where the other hardware can intervene.

I'm not "thinking" that, the language defines it that way (it's part of the 
external effect of the program, C.6(20) and 1.1.3(8-14)). I don't know how a 
compiler could possibly know if someone has a way to see the external 
effects (some CPU monitor, active memory, etc.), so one has to assume that 
they CAN tell. And in any case, the intent of the language designers is 
crystal-clear -- ignore that at your peril.

...
>Hmm. You need to read the (x86 and most other) rules for prefetching memory 
>locations.
>As I stated earlier in this thread, any CPU core is allowed to do a 
>prefetch of any location,
>as long as it does not cause an action like an interrupt.  This is 
>necessary, or you wouldn't
>be able to have any Atomic accesses. The CPU reads an entire cache line, 
>usually 64, 128,
>or 256 bytes into the cache, but even the widest of CPU writes are only 64 
>bytes wide, and
>those are multiple floating-point values.

I have the compiler generate the "correct" code and don't worry much about 
what the CPU does to it. (That's one reason I was completely in favor of 
removing the bogus requirement to read/write all the way to memory -- it 
doesn't make sense on CPUs that do caching.) That's out of my hands in any 
case as a software guy. (Refer to my response to Jere.) There's a 
requirement to prevent CPU reordering, but nothing else.

But I wouldn't be implementing the language if I started ignoring the 
requirements on the compiler by leaning on an as-if optimization, especially 
one that can be detected with active hardware. Especially as if I did so, 
there would be no possibility of ever writing portable synchonization 
code...

                                Randy.





^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-11  8:24               ` Dmitry A. Kazakov
@ 2017-05-15 22:53                 ` Randy Brukardt
  2017-05-18 17:44                   ` Dmitry A. Kazakov
  0 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-15 22:53 UTC (permalink / raw)


"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:of174q$13o7$1@gioia.aioe.org...
> On 11/05/2017 02:35, Randy Brukardt wrote:
>
>> Still, it would be nice to try something in this area. One could imagine
>> creating an Annex C test that implemented a lock-free algorithm and just
>> tried to see if it worked properly.
>
> The real application used socked communication. The problem I suspect is 
> that artificial tests would not show problems full-size application have. 
> Maybe, a better strategy could be specific low-level tests for atomic 
> objects used in lock-free algorithms.

That's what I was thinking. But I'm not really the person to propose such 
tests, I know just enough to be dangerous. ;-) I'd be much better served 
being just an editor on such tests, as I wouldn't have too much trouble 
seeing things that need to be more portable (I have a lot of experience 
doing that).

> And we really should extend the Annex C with some basic atomic operations 
> like atomic increment, test-then-increment, test-then-exchange etc, all 
> falling back to protected objects if the machine lacks corresponding 
> instructions.

I think you are right about that. Supposedly, implementations are supposed 
to provide access to such things via machine code insertions, but that is 
never going to be portable (even to other implementations for the same 
target). So that's not really helpful.

Perhaps you could to send a problem statement and brief proposal to 
Ada-Comment? There's still time to make suggestions for Ada 2020, and 
tasking issues of all kinds are right in the wheelhouse of what we're hoping 
to accomplish. (As always, the ARG can find ways to solve problems, the real 
issue is knowing what the problems are. I believe that you are saying that 
creating portable lock-free algorithms are harder than necessary because of 
limitations in what can be atomic and what can be safely done with atomic 
objects. That seems like an important problem for the ARG to spend some time 
on.)

                                  Randy.





^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-15 22:53                 ` Randy Brukardt
@ 2017-05-18 17:44                   ` Dmitry A. Kazakov
  2017-05-18 21:01                     ` Randy Brukardt
  0 siblings, 1 reply; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-18 17:44 UTC (permalink / raw)


On 16/05/2017 00:53, Randy Brukardt wrote:
> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
> news:of174q$13o7$1@gioia.aioe.org...
>> On 11/05/2017 02:35, Randy Brukardt wrote:
>>
>>> Still, it would be nice to try something in this area. One could imagine
>>> creating an Annex C test that implemented a lock-free algorithm and just
>>> tried to see if it worked properly.
>>
>> The real application used socked communication. The problem I suspect is
>> that artificial tests would not show problems full-size application have.
>> Maybe, a better strategy could be specific low-level tests for atomic
>> objects used in lock-free algorithms.
> 
> That's what I was thinking. But I'm not really the person to propose such
> tests, I know just enough to be dangerous. ;-) I'd be much better served
> being just an editor on such tests, as I wouldn't have too much trouble
> seeing things that need to be more portable (I have a lot of experience
> doing that).
> 
>> And we really should extend the Annex C with some basic atomic operations
>> like atomic increment, test-then-increment, test-then-exchange etc, all
>> falling back to protected objects if the machine lacks corresponding
>> instructions.
> 
> I think you are right about that. Supposedly, implementations are supposed
> to provide access to such things via machine code insertions, but that is
> never going to be portable (even to other implementations for the same
> target). So that's not really helpful.
> 
> Perhaps you could to send a problem statement and brief proposal to
> Ada-Comment? There's still time to make suggestions for Ada 2020, and
> tasking issues of all kinds are right in the wheelhouse of what we're hoping
> to accomplish. (As always, the ARG can find ways to solve problems, the real
> issue is knowing what the problems are. I believe that you are saying that
> creating portable lock-free algorithms are harder than necessary because of
> limitations in what can be atomic and what can be safely done with atomic
> objects. That seems like an important problem for the ARG to spend some time
> on.)

I was thinking about something slightly higher level than tedious CAS. 
E.g. predefined generic package:

generic
    type Value_Type is (<>);
    Initial_Value : Value_Type'Base;
package Atomic_Object is
--
-- Holder_Type -- Type of the atomic object to keep Value_Type'Base
--
    type Holder_Type is private;
    pragma Atomic (Holder_Type); -- It is a private type,
                                 -- so it must be safe to assign
--
-- Get actual value
--
    function Load (Item : Holder_Type) return Value_Type'Base;
--
-- Set new value
--
    procedure Store (Item : in out Holder_Type;
                     New_Value : Value_Type'Base);
    procedure Store (Item : in out Holder;
                     New_Value : Value_Type'Base;
                     Old_Value : out Value_Type'Base);
--
-- Compare old value to the barrier and if successful
-- store new value
--
    procedure Store_If_Greater (Item : in out Holder_Type;
                                Barrier : Value_Type'Base;
                                New_Value : Value_Type'Base);
    procedure Store_If_Greater (Item : in out Holder_Type;
                                Barrier : Value_Type'Base;
                                New_Value : Value_Type'Base;
                                Old_Value : out Value_Type'Base);
    procedure Store_If_Equal ...
    procedure Store_If_Inequal ...
    procedure Store_If_Less ...
    procedure Store_If_Greater ...
--
-- Increment value
--
    procedure Increment (Item : in out Holder;
                         Increment : Value_Type'Base);
    procedure Increment (Item : in out Holder;
                         Increment : Value_Type'Base;
                         Old_Value : out Value_Type'Base);
--
-- Compare old value to the barrier and if successful increment
--
    procedure Increment_If_Greater (Item : in out Holder_Type;
                                    Barrier : Value_Type'Base;
                                    Increment : Value_Type'Base);
    procedure Increment_If_Greater (Item : in out Holder_Type;
                                    Barrier : Value_Type'Base;
                                    Increment : Value_Type'Base;
                                    Old_Value : out Value_Type'Base);
    procedure Increment_If_Equal ...
    procedure Increment_If_Inequal ...
    procedure Increment_If_Less ...
    procedure Increment_If_Greater ...

end Atomic_Object;

The implementation should be free to use machine instructions, a 
spin-lock, or a protected object.

Example: reference counted object:

type Count_Type is mod 2**32;
package Atomic_Counters is new Atomic_Object (Count_Type, 0);

type Reference_Counted is record
    Use_Count : Atomic_Counters.Holder_Type;
    ...
end record;

type Reference_Counted_Ptr is access all Reference_Counted;
procedure Free is new Ada.Unchecked_Deallocation (Reference_Counted,
                                                 Reference_Counted_Ptr);
type Handle is new Ada.Finalization.Controlled with record
    Target : Reference_Counted_Ptr;
end record;

procedure Adjust (Pointer : in out Handle) is
begin -- Could check overflows using Increment_If_Less
    Atomic_Counters.Increment (Pointer.Target.Use_Count, 1);
end Adjust;

procedure Finalize (Pointer : in out Handle) is
    Old_Value : Count_Type;
begin
    if Pointer.Target /= null then
       Atomic_Counters.Increment_If_Greater
       (  Item      => Pointer.Target.Use_Count,
          Barrier   => 0,
          Increment => 0 - 1, -- Decrement by 1
          Old_Value => Old_Value
       );
       case Old_Value is
          when 0 => -- Use count is unexpectedly 0
             raise Program_Error;
          when 1 => -- Going to destroy the target
             Free (Pointer.Target);
          when others =>
             null;
       end case;
    end if;
end Finalize;

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-18 17:44                   ` Dmitry A. Kazakov
@ 2017-05-18 21:01                     ` Randy Brukardt
  2017-05-19  7:54                       ` Dmitry A. Kazakov
  0 siblings, 1 reply; 65+ messages in thread
From: Randy Brukardt @ 2017-05-18 21:01 UTC (permalink / raw)


As I always repeat, the ARG is looking more for problems than for solutions 
(we're plenty inventive in that regard). The most useful thing is a problem 
statement on the lines of "I can't write a portable <something> algorithm 
because Ada doesn't provide any portable <blah> operations". Along with an 
explanation of the <something> algorithm.

It's fairly obvious to me that since Atomic only protects a read or a write, 
but not both that classic operations like test-and-set can't be (directly) 
written using it. Some algorithms can be constructed without test-and-set, 
but plenty of others can't be.

I'd think that the feature in question needs to be a low-level as possible 
so that it can be implemented as efficiently as possible (if a protected 
object is needed to implement it, there is no value to the low-level feature 
as anyone can -- and should -- use a PO if that is possible).

I'd have no idea how to implement your generic without resorting to a PO, 
which would defeat the purpose. (We recently had this discussion wrt 
Suspension_Objects -- they can always be implemented with a PO, but the 
entire point was that they ought to be simpler than that. Thus we didn't 
make any additional requirements on them and in fact clarified that only one 
task is supposed to be calling Suspend_Until_True. The same sort of dynamic 
seems to be in effect here.)

Anyway, as I said, I know just enough to be dangerous here. You clearly know 
more, so send some comments to Ada-Comment to get a discussion started. Else 
it will never make the agenda...and then its pretty certain that no changes 
will happen!

                                Randy.

"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message 
news:ofkmin$1iog$1@gioia.aioe.org...
> On 16/05/2017 00:53, Randy Brukardt wrote:
>> "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> wrote in message
>> news:of174q$13o7$1@gioia.aioe.org...
>>> On 11/05/2017 02:35, Randy Brukardt wrote:
>>>
>>>> Still, it would be nice to try something in this area. One could 
>>>> imagine
>>>> creating an Annex C test that implemented a lock-free algorithm and 
>>>> just
>>>> tried to see if it worked properly.
>>>
>>> The real application used socked communication. The problem I suspect is
>>> that artificial tests would not show problems full-size application 
>>> have.
>>> Maybe, a better strategy could be specific low-level tests for atomic
>>> objects used in lock-free algorithms.
>>
>> That's what I was thinking. But I'm not really the person to propose such
>> tests, I know just enough to be dangerous. ;-) I'd be much better served
>> being just an editor on such tests, as I wouldn't have too much trouble
>> seeing things that need to be more portable (I have a lot of experience
>> doing that).
>>
>>> And we really should extend the Annex C with some basic atomic 
>>> operations
>>> like atomic increment, test-then-increment, test-then-exchange etc, all
>>> falling back to protected objects if the machine lacks corresponding
>>> instructions.
>>
>> I think you are right about that. Supposedly, implementations are 
>> supposed
>> to provide access to such things via machine code insertions, but that is
>> never going to be portable (even to other implementations for the same
>> target). So that's not really helpful.
>>
>> Perhaps you could to send a problem statement and brief proposal to
>> Ada-Comment? There's still time to make suggestions for Ada 2020, and
>> tasking issues of all kinds are right in the wheelhouse of what we're 
>> hoping
>> to accomplish. (As always, the ARG can find ways to solve problems, the 
>> real
>> issue is knowing what the problems are. I believe that you are saying 
>> that
>> creating portable lock-free algorithms are harder than necessary because 
>> of
>> limitations in what can be atomic and what can be safely done with atomic
>> objects. That seems like an important problem for the ARG to spend some 
>> time
>> on.)
>
> I was thinking about something slightly higher level than tedious CAS. 
> E.g. predefined generic package:
>
> generic
>    type Value_Type is (<>);
>    Initial_Value : Value_Type'Base;
> package Atomic_Object is
> --
> -- Holder_Type -- Type of the atomic object to keep Value_Type'Base
> --
>    type Holder_Type is private;
>    pragma Atomic (Holder_Type); -- It is a private type,
>                                 -- so it must be safe to assign
> --
> -- Get actual value
> --
>    function Load (Item : Holder_Type) return Value_Type'Base;
> --
> -- Set new value
> --
>    procedure Store (Item : in out Holder_Type;
>                     New_Value : Value_Type'Base);
>    procedure Store (Item : in out Holder;
>                     New_Value : Value_Type'Base;
>                     Old_Value : out Value_Type'Base);
> --
> -- Compare old value to the barrier and if successful
> -- store new value
> --
>    procedure Store_If_Greater (Item : in out Holder_Type;
>                                Barrier : Value_Type'Base;
>                                New_Value : Value_Type'Base);
>    procedure Store_If_Greater (Item : in out Holder_Type;
>                                Barrier : Value_Type'Base;
>                                New_Value : Value_Type'Base;
>                                Old_Value : out Value_Type'Base);
>    procedure Store_If_Equal ...
>    procedure Store_If_Inequal ...
>    procedure Store_If_Less ...
>    procedure Store_If_Greater ...
> --
> -- Increment value
> --
>    procedure Increment (Item : in out Holder;
>                         Increment : Value_Type'Base);
>    procedure Increment (Item : in out Holder;
>                         Increment : Value_Type'Base;
>                         Old_Value : out Value_Type'Base);
> --
> -- Compare old value to the barrier and if successful increment
> --
>    procedure Increment_If_Greater (Item : in out Holder_Type;
>                                    Barrier : Value_Type'Base;
>                                    Increment : Value_Type'Base);
>    procedure Increment_If_Greater (Item : in out Holder_Type;
>                                    Barrier : Value_Type'Base;
>                                    Increment : Value_Type'Base;
>                                    Old_Value : out Value_Type'Base);
>    procedure Increment_If_Equal ...
>    procedure Increment_If_Inequal ...
>    procedure Increment_If_Less ...
>    procedure Increment_If_Greater ...
>
> end Atomic_Object;
>
> The implementation should be free to use machine instructions, a 
> spin-lock, or a protected object.
>
> Example: reference counted object:
>
> type Count_Type is mod 2**32;
> package Atomic_Counters is new Atomic_Object (Count_Type, 0);
>
> type Reference_Counted is record
>    Use_Count : Atomic_Counters.Holder_Type;
>    ...
> end record;
>
> type Reference_Counted_Ptr is access all Reference_Counted;
> procedure Free is new Ada.Unchecked_Deallocation (Reference_Counted,
>                                                 Reference_Counted_Ptr);
> type Handle is new Ada.Finalization.Controlled with record
>    Target : Reference_Counted_Ptr;
> end record;
>
> procedure Adjust (Pointer : in out Handle) is
> begin -- Could check overflows using Increment_If_Less
>    Atomic_Counters.Increment (Pointer.Target.Use_Count, 1);
> end Adjust;
>
> procedure Finalize (Pointer : in out Handle) is
>    Old_Value : Count_Type;
> begin
>    if Pointer.Target /= null then
>       Atomic_Counters.Increment_If_Greater
>       (  Item      => Pointer.Target.Use_Count,
>          Barrier   => 0,
>          Increment => 0 - 1, -- Decrement by 1
>          Old_Value => Old_Value
>       );
>       case Old_Value is
>          when 0 => -- Use count is unexpectedly 0
>             raise Program_Error;
>          when 1 => -- Going to destroy the target
>             Free (Pointer.Target);
>          when others =>
>             null;
>       end case;
>    end if;
> end Finalize;
>
> -- 
> Regards,
> Dmitry A. Kazakov
> http://www.dmitry-kazakov.de 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Portable memory barrier?
  2017-05-18 21:01                     ` Randy Brukardt
@ 2017-05-19  7:54                       ` Dmitry A. Kazakov
  0 siblings, 0 replies; 65+ messages in thread
From: Dmitry A. Kazakov @ 2017-05-19  7:54 UTC (permalink / raw)


On 18/05/2017 23:01, Randy Brukardt wrote:
> As I always repeat, the ARG is looking more for problems than for solutions
> (we're plenty inventive in that regard). The most useful thing is a problem
> statement on the lines of "I can't write a portable <something> algorithm
> because Ada doesn't provide any portable <blah> operations". Along with an
> explanation of the <something> algorithm.
> 
> It's fairly obvious to me that since Atomic only protects a read or a write,
> but not both that classic operations like test-and-set can't be (directly)
> written using it. Some algorithms can be constructed without test-and-set,
> but plenty of others can't be.

AFAIK, only FIFO can rely solely on Load and Store. Almost everything 
else requires test and/or old value in some form.

> I'd think that the feature in question needs to be a low-level as possible
> so that it can be implemented as efficiently as possible (if a protected
> object is needed to implement it, there is no value to the low-level feature
> as anyone can -- and should -- use a PO if that is possible).

There is no value if it is too low level. You don't get it portable 
anyway in the sense that there is no guaranty that each target would 
have an instruction for that. Not even CAS (Store_If_Equal in the 
generic package outline).

Note that some higher-level operations are implementable with CAS. E.g. 
atomic increment.

> I'd have no idea how to implement your generic without resorting to a PO,
> which would defeat the purpose.

Not at all. The programmer has no idea what would be more efficient to 
implement atomic increment on the given machine:

1. CAS in a loop
2. OS call (e.g. under Windows)
3. Built-in operation (e.g. in GCC)
4. protected object

Even if he knows the outcome is not portable except PO, which you 
already declared as no runner. The compiler vendor has and can do the 
right choice and the programmer just does not care what is going behind.

> (We recently had this discussion wrt
> Suspension_Objects -- they can always be implemented with a PO, but the
> entire point was that they ought to be simpler than that. Thus we didn't
> make any additional requirements on them and in fact clarified that only one
> task is supposed to be calling Suspend_Until_True. The same sort of dynamic
> seems to be in effect here.)

[OT

IMO raw suspension objects have no use whatsoever. What is really 
required is entry interface for a tagged type so that you could use 
timed entry calls in the interfaces rather than passing timeout 
parameter around and handling exceptions later:

    select
       Driver.Queue (IO_Request);
    or delay 0.001;
       -- The device is busy
    end select;
/OT]

> Anyway, as I said, I know just enough to be dangerous here. You clearly know
> more, so send some comments to Ada-Comment to get a discussion started. Else
> it will never make the agenda...and then its pretty certain that no changes
> will happen!

I hoped to get some input from c.l.a. Maybe nobody cares, or maybe some 
have other and better ideas, use cases etc.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2017-05-19  7:54 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-06  2:23 Portable memory barrier? Jere
2017-05-06  8:47 ` Jeffrey R. Carter
2017-05-06 14:17   ` Jere
2017-05-06 19:08     ` Dennis Lee Bieber
2017-05-06 19:26     ` Jeffrey R. Carter
2017-05-06 19:41     ` Jeffrey R. Carter
2017-05-06 20:42       ` Niklas Holsti
2017-05-09 19:49     ` Randy Brukardt
2017-05-09 22:07       ` Jere
2017-05-11  1:14         ` Randy Brukardt
2017-05-10 18:28       ` Shark8
2017-05-11  1:17         ` Randy Brukardt
2017-05-11 16:23           ` Jeffrey R. Carter
2017-05-07 20:18 ` Robert Eachus
2017-05-08  7:45   ` Dmitry A. Kazakov
2017-05-08 15:56     ` Robert Eachus
2017-05-08 16:22       ` Dmitry A. Kazakov
2017-05-08 18:39         ` Robert Eachus
2017-05-08 19:18         ` Robert Eachus
2017-05-08 21:09           ` Dmitry A. Kazakov
2017-05-08 23:24             ` Robert Eachus
2017-05-09  0:30               ` Jere
2017-05-09  4:02                 ` Robert Eachus
2017-05-09  4:32                 ` Robert Eachus
2017-05-09  4:44                   ` Robert Eachus
2017-05-09 22:26                   ` Jere
2017-05-09 20:01                 ` Randy Brukardt
2017-05-09 19:57               ` Randy Brukardt
2017-05-10  0:51                 ` Jere
2017-05-10  5:25                   ` J-P. Rosen
2017-05-10 22:56                     ` Jere
2017-05-11  7:36                       ` Dmitry A. Kazakov
2017-05-13 20:25                         ` Jere
2017-05-10  7:13                   ` Dmitry A. Kazakov
2017-05-10 16:45                     ` Robert Eachus
2017-05-10 17:28                       ` Simon Wright
2017-05-10 23:21                     ` Jere
2017-05-11  0:47                       ` Randy Brukardt
2017-05-13 20:11                         ` Jere
2017-05-15 22:33                           ` Randy Brukardt
2017-05-10 23:30                     ` Jere
2017-05-11  0:38                     ` Randy Brukardt
2017-05-10 16:38                   ` Jeffrey R. Carter
2017-05-10 23:40                     ` Jere
2017-05-10 16:19                 ` Robert Eachus
2017-05-11  1:02                   ` Randy Brukardt
2017-05-11  1:51                     ` Robert Eachus
2017-05-15 22:45                       ` Randy Brukardt
2017-05-08 20:29         ` Niklas Holsti
2017-05-08 21:09           ` Dmitry A. Kazakov
2017-05-09  4:34             ` Niklas Holsti
2017-05-09  6:16               ` Niklas Holsti
2017-05-09  8:34                 ` Dmitry A. Kazakov
2017-05-09 20:18                 ` Randy Brukardt
2017-05-09 20:10           ` Randy Brukardt
2017-05-09  0:05         ` Jere
2017-05-09  8:26           ` Dmitry A. Kazakov
2017-05-09 19:53         ` Randy Brukardt
2017-05-09 20:27           ` Dmitry A. Kazakov
2017-05-11  0:35             ` Randy Brukardt
2017-05-11  8:24               ` Dmitry A. Kazakov
2017-05-15 22:53                 ` Randy Brukardt
2017-05-18 17:44                   ` Dmitry A. Kazakov
2017-05-18 21:01                     ` Randy Brukardt
2017-05-19  7:54                       ` Dmitry A. Kazakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox