Does StampedLock need a releaseFence in theory?

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Does StampedLock need a releaseFence in theory?

Martin Buchholz-3
Somehow I ended up re-reading Hans' excellent
Can Seqlocks Get Along With Programming Language Memory Models?
and was wondering about why there wasn't a symmetrical releaseFence to pair up with the acquireFence.  The proof in section 5 says

"""
The correctness argument is basically as above, but the
happens-before chain becomes:
Initial update of seq by w is sequenced before
Write to datan by w synchronizes with
The acquire fence in r (since preceding operation saw
write, 29.8 p4 in [11]) is sequenced before
final read of seq by r
"""

But if the write to datan is a relaxed write (as would be written by programmers accustomed to using plain writes in a critical section), then I don't see the "Write to datan by w synchronizes with ..."  and I wonder whether for theoretical correctness one needs something like:

--- src/main/java/util/concurrent/locks/StampedLock.java 9 Jun 2016 00:32:02 -0000 1.60
+++ src/main/java/util/concurrent/locks/StampedLock.java 16 Jun 2016 01:30:09 -0000
@@ -349,9 +347,14 @@
     public long tryWriteLock() {
         long s, next;
-        return ((((s = state) & ABITS) == 0L &&
-                 STATE.compareAndSet(this, s, next = s + WBIT)) ?
-                next : 0L);
+        if (((s = state) & ABITS) == 0L &&
+            STATE.compareAndSet(this, s, next = s + WBIT)) {
+            // Necessary in theory, but not in practice?
+            VarHandle.releaseFence();
+            return next;
+        } else {
+            return 0L;
+        }
     }
 
In practice it's probably not an issue because CASes are implemented using full fences, and the JIT cannot reorder the dependent write.

BTW, with jdk9 VarHandles seqlocks can be implemented without Unsafe, so a "new release" of the Seqlocks paper might be useful.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Hans Boehm
[Sorry about the delay]

That paper clearly paid too little attention to the writer side of the seqlock implementation.  It has an outright (boring, unrelated) bug, which was minimally fixed in the slides at http://safari.ece.cmu.edu/MSPC2012/slides_posters/boehm-slides.pdf . (This bug convinced me that the C++ compare_exchange semantics that update the expected value may not have been the best thing since sliced bread.)

The writer code also assumes that data updates use plain C++ assignments to atomic variables, which are sequentially consistent. Thus it dodges all further memory ordering issues in the writer.  At some cost. But the assumption throughout is that the writer is not performance critical.

If the data updates are allowed to use relaxed operations, then Martin's concern is entirely valid. I believe it is valid in practice, not just in theory.  The easiest way to convince yourself is to consider the case in which compareAndSet is in fact implemented with a lock.  Using "roach motel semantics", the data stores are allowed to move into the compareAndSet critical section, and become visible before the compareAndSet update itself. Oops.

An ARMv8 compareAndSet operation (using only acquire and release operations, not dmb, as it should be implemented) will behave like the lock-based one in this respect.  I think the current code above is incorrect on ARMv8 (barring compensating pessimizations elsewhere).

On Wed, Jun 15, 2016 at 6:53 PM, Martin Buchholz <[hidden email]> wrote:
Somehow I ended up re-reading Hans' excellent
Can Seqlocks Get Along With Programming Language Memory Models?
and was wondering about why there wasn't a symmetrical releaseFence to pair up with the acquireFence.  The proof in section 5 says

"""
The correctness argument is basically as above, but the
happens-before chain becomes:
Initial update of seq by w is sequenced before
Write to datan by w synchronizes with
The acquire fence in r (since preceding operation saw
write, 29.8 p4 in [11]) is sequenced before
final read of seq by r
"""

But if the write to datan is a relaxed write (as would be written by programmers accustomed to using plain writes in a critical section), then I don't see the "Write to datan by w synchronizes with ..."  and I wonder whether for theoretical correctness one needs something like:

--- src/main/java/util/concurrent/locks/StampedLock.java 9 Jun 2016 00:32:02 -0000 1.60
+++ src/main/java/util/concurrent/locks/StampedLock.java 16 Jun 2016 01:30:09 -0000
@@ -349,9 +347,14 @@
     public long tryWriteLock() {
         long s, next;
-        return ((((s = state) & ABITS) == 0L &&
-                 STATE.compareAndSet(this, s, next = s + WBIT)) ?
-                next : 0L);
+        if (((s = state) & ABITS) == 0L &&
+            STATE.compareAndSet(this, s, next = s + WBIT)) {
+            // Necessary in theory, but not in practice?
+            VarHandle.releaseFence();
+            return next;
+        } else {
+            return 0L;
+        }
     }
 
In practice it's probably not an issue because CASes are implemented using full fences, and the JIT cannot reorder the dependent write.

BTW, with jdk9 VarHandles seqlocks can be implemented without Unsafe, so a "new release" of the Seqlocks paper might be useful.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Andrew Haley
On 14/07/16 01:53, Hans Boehm wrote:
> An ARMv8 compareAndSet operation (using only acquire and release
> operations, not dmb, as it should be implemented) will behave like the
> lock-based one in this respect.  I think the current code above is
> incorrect on ARMv8 (barring compensating pessimizations elsewhere).

Umm, what?  The ARMv8 compareAndSet has a sequentially consistent store.
I guess I must be missing something important.

Andrew.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Martin Buchholz-3


On Thu, Jul 14, 2016 at 1:23 AM, Andrew Haley <[hidden email]> wrote:
On 14/07/16 01:53, Hans Boehm wrote:
> An ARMv8 compareAndSet operation (using only acquire and release
> operations, not dmb, as it should be implemented) will behave like the
> lock-based one in this respect.  I think the current code above is
> incorrect on ARMv8 (barring compensating pessimizations elsewhere).

Umm, what?  The ARMv8 compareAndSet has a sequentially consistent store.
I guess I must be missing something important.

(Pretending to be Hans here ...)
The idea is that all ARMv8 "load-acquire/store-release" operations (including those used for implementing CAS) are sequentially consistent when considered as a group in the same way that all "synchronization actions" in Java are, but they can still be reordered with plain reads/writes, just like Java plain variable access can be reordered with volatile variable access (unless a happens-before relationship exists).

The section in
on aarch64 should be useful.

ldar corresponds to Java volatile read
stlr corresponds to Java volatile write
CAS is implemented using a ldaxr followed by stlxr which is efficient, but allows subsequent writes to move in between the ldaxr and the stlxr.

(Back to being Martin ...)
Reordering a plain store from after to before a stlxr (rather than non-exclusive stlr) is still rather surprising because it looks like a speculative store - we don't know yet whether the stlxr will succeed.  Unlike the case where we implement CAS via a lock.  Am I thinking too atomically?  Perhaps the stlxr instruction implementation exclusively acquires the cache line, sees that it surely will succeed, but will be slow because pending memory operations must be completed first.  But no reason we can't start executing future instructions in the meantime!?

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Andrew Haley
On 14/07/16 16:27, Martin Buchholz wrote:

> On Thu, Jul 14, 2016 at 1:23 AM, Andrew Haley <[hidden email]> wrote:
>
>> On 14/07/16 01:53, Hans Boehm wrote:
>>> An ARMv8 compareAndSet operation (using only acquire and release
>>> operations, not dmb, as it should be implemented) will behave like the
>>> lock-based one in this respect.  I think the current code above is
>>> incorrect on ARMv8 (barring compensating pessimizations elsewhere).
>>
>> Umm, what?  The ARMv8 compareAndSet has a sequentially consistent store.
>> I guess I must be missing something important.
>
>
> (Pretending to be Hans here ...)
> The idea is that all ARMv8 "load-acquire/store-release" operations
> (including those used for implementing CAS) are sequentially consistent
> when considered as a group in the same way that all "synchronization
> actions" in Java are, but they can still be reordered with plain
> reads/writes, just like Java plain variable access can be reordered with
> volatile variable access (unless a happens-before relationship exists).
>
> The section in
> https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
> on aarch64 should be useful.

I get that.  I'm just trying to understand exactly which scenario Hans
is worried about.

void writer(...) {
    unsigned seq0 = seq;
    while (seq0 & 1 ||
           !seq.cmp_exc_wk
           (seq0,seq0+1));
    { seq0 = seq; }
    data1 = ...;
    data2 = ...;
    seq = seq0 + 2;
}

> CAS is implemented using a ldaxr followed by stlxr which is efficient, but
> allows subsequent writes to move in between the ldaxr and the stlxr.

OK, got that.  A write of data1 might move before the seq.cmp_exc_wk
has succeeded in bumping seq0 (the version clock).

Reading the AArch64 specification as carefully as I can, I see

-------------------------------------------------------------------
For a Store-Release, observers in the shareability domain of the
address accessed by the Store-Release observe:

1.  Both of the following, if the shareability of the addresses
accessed requires that the observer observes them:

        Reads and writes caused by loads and stores appearing in
        program order before the Store-Release.

        Writes that have been observed by the PE executing the
        Store-Release before executing the Store-Release.

2.  The write caused by the Store-Release.

There are no additional ordering requirements on loads or stores that
appear in program order after the Store-Release.
-------------------------------------------------------------------

So, yes, it is quite possible for a write of data1 to move before the
write of the clock.  And I can see why that would be bad.  We really
need something like

void writer(...) {
    unsigned seq0 = seq;
    while (seq0 & 1 ||
           !seq.cmp_exc_wk
           (seq0,seq0+1));
    { seq0 = seq; }
    atomic_thread_fence(m_o_store_store);
    data1 = ...;
    data2 = ...;
    seq = seq0 + 2;
}

> (Back to being Martin ...)
> Reordering a plain store from after to before a stlxr (rather than
> non-exclusive stlr) is still rather surprising because it looks like
> a speculative store - we don't know yet whether the stlxr will
> succeed.  Unlike the case where we implement CAS via a lock.

But even CAS via a lock has the roach motel property, so we're used to
the idea of stores and loads moving into a critical section: none of
this should surprise people.

> Am I thinking too atomically?  Perhaps the stlxr instruction
> implementation exclusively acquires the cache line, sees that it
> surely will succeed, but will be slow because pending memory
> operations must be completed first.

I don't think it has to be so complicated.  An out-of-order
implementation can move stores which are after this stlxr to before it
simply because there is no logic preventing it from doing so.  Whether
this is a realistic or useful property of a real machine is a whole
'nother matter.

Andrew.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Hans Boehm
On Thu, Jul 14, 2016 at 10:16 AM, Andrew Haley <[hidden email]> wrote:

> (Back to being Martin ...)
> Reordering a plain store from after to before a stlxr (rather than
> non-exclusive stlr) is still rather surprising because it looks like
> a speculative store - we don't know yet whether the stlxr will
> succeed.  Unlike the case where we implement CAS via a lock.

But even CAS via a lock has the roach motel property, so we're used to
the idea of stores and loads moving into a critical section: none of
this should surprise people.

> Am I thinking too atomically?  Perhaps the stlxr instruction
> implementation exclusively acquires the cache line, sees that it
> surely will succeed, but will be slow because pending memory
> operations must be completed first.

I don't think it has to be so complicated.  An out-of-order
implementation can move stores which are after this stlxr to before it
simply because there is no logic preventing it from doing so.  Whether
this is a realistic or useful property of a real machine is a whole
'nother matter.

Andrew.
I think Martin does have a point here that I missed the first time.  The data stores are "control-dependent" on the store in the CAS implementation.  ARM and Power effectively preserve ordering between control-dependent stores, though not a store followed by a load. But we really only care about stores here.

At least that's probably true; I'm not 100% sure whether stlxr success is viewed as dependent on the store itself for purposes of determining ordering. (Will Deacon might be able to answer?)

There is also another subtlety here in the definition of "control-dependency" in that arguably the control paths have rejoined. My understanding is that ARM's definition of "control-dependency" is not affected by such rejoining, but Itanium's is.

A spin-lock-based implementation would indeed still have the issue, since the lock release there is a plain store.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Andrew Haley
On 14/07/16 20:19, Hans Boehm wrote:

> I think Martin does have a point here that I missed the first time.
> The data stores are "control-dependent" on the store in the CAS
> implementation.  ARM and Power effectively preserve ordering between
> control-dependent stores, though not a store followed by a load. But
> we really only care about stores here.
>
> At least that's probably true; I'm not 100% sure whether stlxr
> success is viewed as dependent on the store itself for purposes of
> determining ordering. (Will Deacon might be able to answer?)

I don't think it matters.  The code will always look like

    stlxr status, data, [addr]
    cbnz status, retry
    str r1, [data1]
    str r2, [data2]

The stores are control-dependent on the CBNZ: they cannot be
speculated before the STLXR.

Andrew.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

David Holmes-6
In reply to this post by Martin Buchholz-3

Regarding:

 

CAS is implemented using a ldaxr followed by stlxr which is efficient, but allows subsequent writes to move in between the ldaxr and the stlxr.

 

“we” don’t think it is the case. Unfortunately the spec does not clearly state this but informally:

-          Any such writes would potentially invalidate the reservation so moving writes into there seems a bad idea unless you expend effort determining it won’t invalidate the reservation

-          Any such writes would be speculative and potentially need undoing. So this also seems like far too much effort for little if any gain

-          If this were truly an issue then the “Barrier Litmus Test” Appendix in the ARMv8 Architecture manual would flag it and show the use of explicit memory barriers

 

It would be good if the ARM folk could clarify this, and if necessary get the “Barrier Litmus test” text updated.

 

Also note that C++11 Cmpxhng-SeqCst  mapping for Aarch64 does not add any additional explicit barriers:

 

https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

 

David

 

From: Concurrency-interest [mailto:[hidden email]] On Behalf Of Martin Buchholz
Sent: Friday, July 15, 2016 1:27 AM
To: Andrew Haley <[hidden email]>
Cc: Doug Lea <[hidden email]>; concurrency-interest <[hidden email]>
Subject: Re: [concurrency-interest] Does StampedLock need a releaseFence in theory?

 

 

 

On Thu, Jul 14, 2016 at 1:23 AM, Andrew Haley <[hidden email]> wrote:

On 14/07/16 01:53, Hans Boehm wrote:
> An ARMv8 compareAndSet operation (using only acquire and release
> operations, not dmb, as it should be implemented) will behave like the
> lock-based one in this respect.  I think the current code above is
> incorrect on ARMv8 (barring compensating pessimizations elsewhere).

Umm, what?  The ARMv8 compareAndSet has a sequentially consistent store.
I guess I must be missing something important.

 

(Pretending to be Hans here ...)

The idea is that all ARMv8 "load-acquire/store-release" operations (including those used for implementing CAS) are sequentially consistent when considered as a group in the same way that all "synchronization actions" in Java are, but they can still be reordered with plain reads/writes, just like Java plain variable access can be reordered with volatile variable access (unless a happens-before relationship exists).

 

The section in

on aarch64 should be useful.

 

ldar corresponds to Java volatile read

stlr corresponds to Java volatile write

CAS is implemented using a ldaxr followed by stlxr which is efficient, but allows subsequent writes to move in between the ldaxr and the stlxr.

 

(Back to being Martin ...)

Reordering a plain store from after to before a stlxr (rather than non-exclusive stlr) is still rather surprising because it looks like a speculative store - we don't know yet whether the stlxr will succeed.  Unlike the case where we implement CAS via a lock.  Am I thinking too atomically?  Perhaps the stlxr instruction implementation exclusively acquires the cache line, sees that it surely will succeed, but will be slow because pending memory operations must be completed first.  But no reason we can't start executing future instructions in the meantime!?


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Andrew Haley
In reply to this post by Andrew Haley
On 15/07/16 09:47, Will Deacon wrote:

> On Fri, Jul 15, 2016 at 08:52:00AM +0100, Andrew Haley wrote:
>>
>> I don't think it matters.  The code will always look like
>>
>>     stlxr status, data, [addr]
>>     cbnz status, retry
>>     str r1, [data1]
>>     str r2, [data2]
>>
>> The stores are control-dependent on the CBNZ: they cannot be
>> speculated before the STLXR.
>
> I'm not so sure about that, unfortunately. You can conceive of a
> microarchitecture that knows stlxr can never fail, in which case the
> control dependency can be folded away. If you want ordering between the
> stlxr and the subsequent stores, you need an explicit barrier.

That's interesting.  I'm surprised that such an optimization can change
the ordering rules, though.

Andrew.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

David Holmes-6
Andrew Haley writes:

> Sent: Friday, July 15, 2016 9:51 PM
> To: Will Deacon <[hidden email]>
> Cc: Martin Buchholz <[hidden email]>; Doug Lea
> <[hidden email]>; concurrency-interest <concurrency-
> [hidden email]>
> Subject: Re: [concurrency-interest] Does StampedLock need a releaseFence
> in theory?
>
> On 15/07/16 09:47, Will Deacon wrote:
> > On Fri, Jul 15, 2016 at 08:52:00AM +0100, Andrew Haley wrote:
> >>
> >> I don't think it matters.  The code will always look like
> >>
> >>     stlxr status, data, [addr]
> >>     cbnz status, retry
> >>     str r1, [data1]
> >>     str r2, [data2]
> >>
> >> The stores are control-dependent on the CBNZ: they cannot be
> >> speculated before the STLXR.
> >
> > I'm not so sure about that, unfortunately. You can conceive of a
> > microarchitecture that knows stlxr can never fail, in which case the
> > control dependency can be folded away. If you want ordering between the
> > stlxr and the subsequent stores, you need an explicit barrier.
>
> That's interesting.  I'm surprised that such an optimization can change
> the ordering rules, though.

Is this "conceive" in a thought-experiment sense or something realistic? Is this just an unintended hole in the abstract architectural model or a deliberate allowance for such "optimizations"?

David
 
> Andrew.
>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Andrew Haley
In reply to this post by Andrew Haley
On 15/07/16 12:50, Andrew Haley wrote:

> On 15/07/16 09:47, Will Deacon wrote:
>> On Fri, Jul 15, 2016 at 08:52:00AM +0100, Andrew Haley wrote:
>>>
>>> I don't think it matters.  The code will always look like
>>>
>>>     stlxr status, data, [addr]
>>>     cbnz status, retry
>>>     str r1, [data1]
>>>     str r2, [data2]
>>>
>>> The stores are control-dependent on the CBNZ: they cannot be
>>> speculated before the STLXR.
>>
>> I'm not so sure about that, unfortunately. You can conceive of a
>> microarchitecture that knows stlxr can never fail, in which case the
>> control dependency can be folded away. If you want ordering between the
>> stlxr and the subsequent stores, you need an explicit barrier.
>
> That's interesting.  I'm surprised that such an optimization can change
> the ordering rules, though.

Thinking about it some more, I am very skeptical.  For example, it is a
common trick to enforce ordering by using an address dependency, such as:

    ldr x1, [addr1]
    eor x0, x0, x1
    eor x0, x0, x1
    ldr x2, [x0]

Here, the load of x2 must be after the load of x1 because of the
address dependency.  A "smart" microarchitecture could fold away the
EORs (and maybe even the load of x1).  But I'm sure that would not be
permitted by the specification.

Andrew.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Doug Lea
In reply to this post by Andrew Haley

On 07/15/2016 10:27 AM, Will Deacon wrote:

> a fence would be required to maintain order with a subsequent store.

Thanks.

I now agree with Martin that we should add VarHandle.storeStoreFence
to write-unlock.

Backing up: We've long known that versioned locks (seqlocks, StampedLock)
should not in general allow "roach-motel" reorderings. We couldn't
introduce StampedLock until we had fence intrinsics for the load-load
case. But no one considered that the implementation of write-unlock
via CAS might permit hardware write reorderings. Thanks to the ARM
folks for finding the tiny bit of wiggle room in CAS specs that
might in theory (if not in practice) require the complementary
fence on unlock. Too bad this will slow down some implementations
for no good reason, but probably not noticeably. I wonder if/how
this impacts other versioned lock implementations in other
languages.

-Doug


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Hans Boehm
My impression is that Linux kernel seqlock use cases don't care about write performance.  Presumably you have some StampedLock clients in mind here that do care, because they don't know the reader-writer mix ahead of time?  If writers are known to be frequent, this doesn't seem like the right tool?


On Fri, Jul 15, 2016 at 10:49 AM, Doug Lea <[hidden email]> wrote:

On 07/15/2016 10:27 AM, Will Deacon wrote:

a fence would be required to maintain order with a subsequent store.

Thanks.

I now agree with Martin that we should add VarHandle.storeStoreFence
to write-unlock.

Backing up: We've long known that versioned locks (seqlocks, StampedLock)
should not in general allow "roach-motel" reorderings. We couldn't
introduce StampedLock until we had fence intrinsics for the load-load
case. But no one considered that the implementation of write-unlock
via CAS might permit hardware write reorderings. Thanks to the ARM
folks for finding the tiny bit of wiggle room in CAS specs that
might in theory (if not in practice) require the complementary
fence on unlock. Too bad this will slow down some implementations
for no good reason, but probably not noticeably. I wonder if/how
this impacts other versioned lock implementations in other
languages.

-Doug



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Doug Lea
On 07/15/2016 02:06 PM, Hans Boehm wrote:
> My impression is that Linux kernel seqlock use cases don't care about write
> performance.  Presumably you have some StampedLock clients in mind here that do
> care, because they don't know the reader-writer mix ahead of time?  If writers
> are known to be frequent, this doesn't seem like the right tool?
>

StampedLock is multimodal. In the recommended usages (for example
those listed in javadocs), users switch from optimistic to ReadLock on
contention.

In any case, I mainly had in mind possible impact on other
versioned STM-ish locks out there with optimistic modes.

-Doug


>
> On Fri, Jul 15, 2016 at 10:49 AM, Doug Lea <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>     On 07/15/2016 10:27 AM, Will Deacon wrote:
>
>         a fence would be required to maintain order with a subsequent store.
>
>
>     Thanks.
>
>     I now agree with Martin that we should add VarHandle.storeStoreFence
>     to write-unlock.
>
>     Backing up: We've long known that versioned locks (seqlocks, StampedLock)
>     should not in general allow "roach-motel" reorderings. We couldn't
>     introduce StampedLock until we had fence intrinsics for the load-load
>     case. But no one considered that the implementation of write-unlock
>     via CAS might permit hardware write reorderings. Thanks to the ARM
>     folks for finding the tiny bit of wiggle room in CAS specs that
>     might in theory (if not in practice) require the complementary
>     fence on unlock. Too bad this will slow down some implementations
>     for no good reason, but probably not noticeably. I wonder if/how
>     this impacts other versioned lock implementations in other
>     languages.
>
>     -Doug
>
>
>
>     _______________________________________________
>     Concurrency-interest mailing list
>     [hidden email] <mailto:[hidden email]>
>     http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
>

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Martin Buchholz-3
In reply to this post by David Holmes-6


On Fri, Jul 15, 2016 at 12:59 AM, David Holmes <[hidden email]> wrote:

 Also note that C++11 Cmpxhng-SeqCst  mapping for Aarch64 does not add any additional explicit barriers:

https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html


I've also been struggling to understand this, having thought of strong CAS as a single atomic bidirectional fenced operation.

 Cmpxchng SeqCst is implemented using a ldaxr followed by a stlxr

I believe it is legal for relaxed memory ops before the ldaxr to be reordered with the ldaxr 
and likewise
I believe it is legal for relaxed memory ops after the stlxr to be reordered with the stlxr
and then to be reordered with each other (roach motel style)
without violating sequential consistency of  ldaxr and stlxr and without interfering with the use of these operations for implementing traditional locks.  But seqlocks are not traditional locks - they're a little "backwards".

I even have a mental model that justifies such behavior.  Suppose there is a slow read in progress and the cpu happens to already have exclusive access to the cache line containing the cas word.  It knows that the cas will succeed because it owns the cache line.  But because of release semantics, the release write cannot complete until the slow read completes.  Cpus hate stalls, so starts executing subsequent relaxed stores.  Unlike the stlxr, which has to wait for the slow read, there is nothing in the spec to prevent the subsequent stores from being written to memory immediately.  If the fast write and the slow read are to the same memory location, the read before the cas can see the write after the cas!

"""The Store-Release places no additional ordering constraints on any loads or stores appearing after the
Store-Release instruction."""


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Hans Boehm
I think a good analogy is to compare the Aarch64 CAS implementation with CAS implemented on top of a roach-motel lock associated with the CAS location. The ordering properties are very simillar for both.

This is a bit unfamiliar because most traditional lock implementations have included fences, and hence have not allowed full roach-motel reordering at the hardware level. But Itanium had fence-less lock implementations before Aarch64 did, with even weaker acquire/release operations.

(On Itanium, I believe

x = a;
unlock l1;
lock l2;
y = a;

does not order the two stores, since release and acquire operations can be reordered.  On Aarch64, it happens to do so.  None of which is detectable in data-race-free programs.)

On Tue, Jul 19, 2016 at 7:42 PM, Martin Buchholz <[hidden email]> wrote:


On Fri, Jul 15, 2016 at 12:59 AM, David Holmes <[hidden email]> wrote:

 Also note that C++11 Cmpxhng-SeqCst  mapping for Aarch64 does not add any additional explicit barriers:

https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html


I've also been struggling to understand this, having thought of strong CAS as a single atomic bidirectional fenced operation.

 Cmpxchng SeqCst is implemented using a ldaxr followed by a stlxr

I believe it is legal for relaxed memory ops before the ldaxr to be reordered with the ldaxr 
and likewise
I believe it is legal for relaxed memory ops after the stlxr to be reordered with the stlxr
and then to be reordered with each other (roach motel style)
without violating sequential consistency of  ldaxr and stlxr and without interfering with the use of these operations for implementing traditional locks.  But seqlocks are not traditional locks - they're a little "backwards".

I even have a mental model that justifies such behavior.  Suppose there is a slow read in progress and the cpu happens to already have exclusive access to the cache line containing the cas word.  It knows that the cas will succeed because it owns the cache line.  But because of release semantics, the release write cannot complete until the slow read completes.  Cpus hate stalls, so starts executing subsequent relaxed stores.  Unlike the stlxr, which has to wait for the slow read, there is nothing in the spec to prevent the subsequent stores from being written to memory immediately.  If the fast write and the slow read are to the same memory location, the read before the cas can see the write after the cas!

"""The Store-Release places no additional ordering constraints on any loads or stores appearing after the
Store-Release instruction."""


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Andrew Haley
In reply to this post by Doug Lea
On 15/07/16 18:49, Doug Lea wrote:

> Thanks to the ARM folks for finding the tiny bit of wiggle room in
> CAS specs that might in theory (if not in practice) require the
> complementary fence on unlock. Too bad this will slow down some
> implementations for no good reason, but probably not noticeably.

Probably not.  From what I've seen, a StoreStore fence on ARMv8
doesn't cost anything significant when the store buffer is already
empty.  Of course this might change in future designs.

Andrew.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Andrew Haley
In reply to this post by Hans Boehm
On 14/07/16 01:53, Hans Boehm wrote:
> An ARMv8 compareAndSet operation (using only acquire and release
> operations, not dmb, as it should be implemented) will behave like the
> lock-based one in this respect.

Following up after a long delay:

What exactly do you mean by "as it should be implemented" here?  is it
that the implementation of compareAndSet in Java should not use
ldaxr; cmp; stlxr ?  Is the mapping in
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html incorrect?

Or are you only referring to StampedLock?

Andrew.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Does StampedLock need a releaseFence in theory?

Hans Boehm
Sorry. My message was syntactically ambiguous. I believe compareAndSet should be implemented using acquire/release operations on ARMv8. It should not be implemented using dmb. The mappings on the Cambridge page for Aarch64 look correct to me.

On Wed, Nov 9, 2016 at 2:03 AM, Andrew Haley <[hidden email]> wrote:
On 14/07/16 01:53, Hans Boehm wrote:
> An ARMv8 compareAndSet operation (using only acquire and release
> operations, not dmb, as it should be implemented) will behave like the
> lock-based one in this respect.

Following up after a long delay:

What exactly do you mean by "as it should be implemented" here?  is it
that the implementation of compareAndSet in Java should not use
ldaxr; cmp; stlxr ?  Is the mapping in
https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html incorrect?

Or are you only referring to StampedLock?

Andrew.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest