jdk9 VarHandle and Fence methods

classic Classic list List threaded Threaded
73 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Hans Boehm
On Tue, Oct 13, 2015 at 9:35 PM, David Holmes <[hidden email]> wrote:
>
> Late to the party here but I’ve been on a nice long vacation … J
>
> The argument below is to me conflating a temporal notion of “happens before” and the memory-model notion.
> Just because I can reason that thread A must have already performed a certain action (ie x=1) because thread B can’t get the lock,
> it does not mean that action is visible to thread B, unless a suitable happens-before edge exists. This should not be at all surprising
> because the store and lock in Thread A can already be freely reordered, if no HB edge exists.
>
Formally, you're correct.  On the other hand, I think both Java and C++ memory models have always tried to make it possible
to reason about correctness purely in terms of a simple "sequential consistency for data-race-free programs" model, where
everything could be understood in terms of thread interleavings.  This assumes you avoid a few constructs that explicitly relax
sequential consistency, like lazySet().  The happens-before reasoning was intended to be optional for the experts.

In my view, the ability to conflate these two notions is an important feature (made reasonably precise in Sarita Adve's and
my PLDI 08 paper), that I don't want to lose.

Hans

>  
>
> David
>
>  

From: [hidden email] [mailto:[hidden email]] On Behalf Of Hans Boehm
Sent: Thursday, September 17, 2015 3:28 AM
To: Oleksandr Otenko
Cc: Doug Lea; [hidden email]
Subject: Re: [concurrency-interest] jdk9 VarHandle and Fence methods

 

This all depends on your perspective.  In my view, if we assumed that trylock() always provides a correct response based on the current state of the lock (something you shouldn't be assuming), then I think we would definitely want the x = 1 to be visible after a failed trylock().  If you write (and yes you shouldn't):

 

Thread 1:

x = 1;

y.lock();

 

Thread 2:

while (y.trylock()) {y.unlock();}

load x;

 

That program does not intuitively have a data race.  The load and store to x cannot occur simultaneously. Thus it should have interleaving-based semantics.  Which means the load of x must return 1.

 

If there is no sw edge from the lock() to the failed trylock(), then it does have a data race.  But that's a contrivance of the model; data races no longer correspond to actual possible concurrent execution in a simple model.  In my view that's bad, since I can no longer rely on intuitions based on simultaneous executions for "data races".  I actually have to teach people about all the happens-before rules.  With the perspective of our PLDI 2008 paper, the two possible definitions of data race no longer agree.

 

People do very occasionally write code like this, vaguely along the lines of the close example in this thread.  ("Somebody's already working on it, so I don't have to.") Weaker ordering properties that guarantee correctness are really hard to explain and at best brittle.  (But that thread had the lock, of course it finished running that earlier initialization code.)  I think it's far easier to just say "trylock() may spuriously fail.  Your code doesn't work.  Use e.g. an atomic getAndSet() instead."

 

 

On Wed, Sep 16, 2015 at 8:28 AM, Oleksandr Otenko <[hidden email]> wrote:

Wow, that's some very flaky relaxation of the meaning of a lock!

I would expect the failing lock acquire to establish no sw edges, but I would certainly expect the order of acquires and releases (successful and no) to be total, and the total order of all acquires and releases (successful and no) to be in alignment with the total order of operations on volatiles. That way indeed x=1 should be visible, but only if it is a volatile store - no guarantees for normal stores.

Also, I would not expect the debugging thread to acquire the lock, if that breaks the protocol. You wouldn't encourage a debugging thread to write to arbitrary volatiles - so you wouldn't encourage the debugging thread to acquire arbitrary locks.

Alex

 

On 15/09/2015 06:16, Hans Boehm wrote:

How does it slow down lock()?

 

It depends on the precise guarantee you provide, and I suspect this thread didn't quite agree on that.  The most natural one is that the succeeding lock acquisition happens before the failed trylock().  That implies that if we have

 

x = 1;

lock();

 

those can't be reordered by the hardware, since a failing trylock() would have to see the assignment to x.  That requires a fence between them on ARM or Power.

 

I think the right way to think of trylock(), at least informally, is as allowing spurious failures. I.e. trylock() is allowed to behave as though the lock was held when it isn't.  You thus can't conclude anything about other threads from the fact that it failed.  In this view you don't have to think about memory ordering issues when reasoning about correctness, you just reason about spurious failures instead.

 

If your code is robust against unknown, e.g. debugger, threads acquiring the lock now and then, then it must be robust against this sort of spurious failure.  If the lock is really used only to provide mutual exclusion, this should not affect correctness.

 

On Mon, Sep 14, 2015 at 6:41 PM, Vitaly Davidovich <[hidden email]> wrote:

How does it slow down lock()?

I don't necessarily disagree but I can certainly see people considering tryLock to have same ordering effect as (failed) CAS.  It's certainly true that a CAS is a lower level primitive than a lock, but I don't know if that resonates immediately when thinking about this.  It's also the case that on very popular platforms such as x86 a failing tryLock will have the same ordering as a successful one, and no difference is observed (and JIT doesn't do anything different).

I don't understand the debugger thread example - what's the issue there?

sent from my phone

On Sep 14, 2015 9:07 PM, "Hans Boehm" <[hidden email]> wrote:

FWIW, this general issues is discussed in section 3 of http://dl.acm.org/citation.cfm?id=1375581.1375591 .

 

Yet another argument against providing the stronger guarantees is that, on many architectures, it doesn't just slow down trylock(), it more importantly slows down lock().  In general, if your code cares about ordering for unsuccessful trylock(), then it's not robust against, say, a debugging thread unexpectedly acquiring the lock for a short period.  In my view, in such a case, you're no longer using it as a lock, and you should be using something else, e.g. an atomic object, with stronger guarantees.

 

On Fri, Sep 4, 2015 at 4:18 AM, Doug Lea <[hidden email]> wrote:

On 09/03/2015 02:19 PM, Oleksandr Otenko wrote:

Has anyone come up with the answer about ordering for tryLock, or have I missed it?


You missed the dog not barking :-)

The Lock specs don't require any specific HB effects here on failed
tryLock. Even if we wanted to, we cannot retroactively impose any
considering that anyone can implement the Lock interface (not just j.u.c)
and some of these might become in violation.

As you and Vitaly pointed out, there are a few fringe cases where
users might want to impose ordering on failure. In jdk9, you'll
me able to do this with moded VarHandle accesses and/or fences. The
resulting extra fencing might be redundant here and there, but if you
cared enough, you could create and rely on custom locks with stronger
guarantees.

-Doug



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

 


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

 

 

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

 

 



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Andrew Haley
In reply to this post by Doug Lea
On 21/08/15 19:58, Doug Lea wrote:

> getOpaque is the same, but with the further constraint that the read
> must actually occur even if other JMM rules would allow it to be
> optimized away (for example due to common subexpression evaluation).
> Usage is rare. Think IO.

I still don't understand.  Under what circumstances could it possibly
matter whether the read actually occurs?  Maybe it'd cause a memory-
mapped file to be read in or something.

We really need some sort of example of what this is for.  (Well, I do
anyway, and I'm sure it's not just me.)  We surely can't be adding
stuff to the library which appears to have no use.

Andrew.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Vitaly Davidovich
I could use a motivating example too.

I suppose it could be used by a thread to loop on a termination condition that's set by another thread.  Today volatile loads are used for that, but lots of times there's no happens-before requirement after the loop exits.  However, I don't know if this requires ordering constraints too (I.e. memory_order_relaxed ought to work too).

I like the atomicity guarantee and the operation having to occur (no elimination by compiler), but the full compiler fence I don't quite understand without an example.  I can see how having this load be dependent on control is useful but otherwise don't quite get further ordering requirements.

On Wednesday, January 27, 2016, Andrew Haley <[hidden email]> wrote:
On 21/08/15 19:58, Doug Lea wrote:

> getOpaque is the same, but with the further constraint that the read
> must actually occur even if other JMM rules would allow it to be
> optimized away (for example due to common subexpression evaluation).
> Usage is rare. Think IO.

I still don't understand.  Under what circumstances could it possibly
matter whether the read actually occurs?  Maybe it'd cause a memory-
mapped file to be read in or something.

We really need some sort of example of what this is for.  (Well, I do
anyway, and I'm sure it's not just me.)  We surely can't be adding
stuff to the library which appears to have no use.

Andrew.
_______________________________________________
Concurrency-interest mailing list
<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;Concurrency-interest@cs.oswego.edu&#39;)">Concurrency-interest@...
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


--
Sent from my phone

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Andrew Haley
On 01/27/2016 12:08 PM, Vitaly Davidovich wrote:
> I suppose it could be used by a thread to loop on a termination condition
> that's set by another thread.  Today volatile loads are used for that, but
> lots of times there's no happens-before requirement after the loop exits.

If you do that on an AArch64, you're going to be waiting for
a very long time indeed.  :-)

Andrew.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Doug Lea
In reply to this post by Andrew Haley
(Sorry for slow replies; swamped..)

On 01/27/2016 04:10 AM, Andrew Haley wrote:

> On 21/08/15 19:58, Doug Lea wrote:
>
>> getOpaque is the same, but with the further constraint that the read
>> must actually occur even if other JMM rules would allow it to be
>> optimized away (for example due to common subexpression evaluation).
>> Usage is rare. Think IO.
>
> I still don't understand.  Under what circumstances could it possibly
> matter whether the read actually occurs?  Maybe it'd cause a memory-
> mapped file to be read in or something.

The canonical example of opaque read is linux RCU.

For opaque writes, the main (nichy) case is publication loops
that do not need fences but still prevent compilers from delaying
the writes until end of loop.

Both of these cases are arranged in C/C++ using volatile-casts.
(As a separate issue, C/C++ specs may someday either integrate
volatile-casts with their memory model, or create explicit
opaque mode.)

For discussion, see jmm-dev archives mainly from last August:
http://mail.openjdk.java.net/pipermail/jmm-dev/2015-August/thread.html

>
> We really need some sort of example of what this is for.  (Well, I do
> anyway, and I'm sure it's not just me.)  We surely can't be adding
> stuff to the library which appears to have no use.

The goal was parity of expressiveness with C/C++11.

-Doug


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Vitaly Davidovich
In reply to this post by Andrew Haley
why's that? how's it different from x86/64? On the surface, it seems like a decent fit -- don't care about ordering with respect to other operations in the loop, just want to read from (cache coherent) memory each time.  If that's indeed the case, then clearly this API needs a lot more explanation (I think that's already been clear given this thread).

On Wed, Jan 27, 2016 at 8:34 AM, Andrew Haley <[hidden email]> wrote:
On 01/27/2016 12:08 PM, Vitaly Davidovich wrote:
> I suppose it could be used by a thread to loop on a termination condition
> that's set by another thread.  Today volatile loads are used for that, but
> lots of times there's no happens-before requirement after the loop exits.

If you do that on an AArch64, you're going to be waiting for
a very long time indeed.  :-)

Andrew.



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Vitaly Davidovich
In reply to this post by Doug Lea
For opaque writes, the main (nichy) case is publication loops
that do not need fences but still prevent compilers from delaying
the writes until end of loop.

Preventing moving/scheduling subsequent unrelated loads, for example, seems like overkill here, no? For this particular example (publication *loop*), a control effect on the opaque set would be enough to prevent it from being delayed until end of loop, but may allow movement within a loop iteration.

Also, given Andrew's comment above regarding AArch64 opaque load delays (which I don't quite understand, but hopefully he'll explain), how would an opaque write work there?
 
The goal was parity of expressiveness with C/C++11.

Is there a reason for not having memory_order_relaxed then?  

On Wed, Jan 27, 2016 at 8:36 AM, Doug Lea <[hidden email]> wrote:
(Sorry for slow replies; swamped..)

On 01/27/2016 04:10 AM, Andrew Haley wrote:
On 21/08/15 19:58, Doug Lea wrote:

getOpaque is the same, but with the further constraint that the read
must actually occur even if other JMM rules would allow it to be
optimized away (for example due to common subexpression evaluation).
Usage is rare. Think IO.

I still don't understand.  Under what circumstances could it possibly
matter whether the read actually occurs?  Maybe it'd cause a memory-
mapped file to be read in or something.

The canonical example of opaque read is linux RCU.

For opaque writes, the main (nichy) case is publication loops
that do not need fences but still prevent compilers from delaying
the writes until end of loop.

Both of these cases are arranged in C/C++ using volatile-casts.
(As a separate issue, C/C++ specs may someday either integrate
volatile-casts with their memory model, or create explicit
opaque mode.)

For discussion, see jmm-dev archives mainly from last August:
http://mail.openjdk.java.net/pipermail/jmm-dev/2015-August/thread.html


We really need some sort of example of what this is for.  (Well, I do
anyway, and I'm sure it's not just me.)  We surely can't be adding
stuff to the library which appears to have no use.

The goal was parity of expressiveness with C/C++11.

-Doug



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Andrew Haley
In reply to this post by Vitaly Davidovich
On 01/27/2016 01:59 PM, Vitaly Davidovich wrote:
> why's that? how's it different from x86/64? On the surface, it seems like a
> decent fit -- don't care about ordering with respect to other operations in
> the loop, just want to read from (cache coherent) memory each time.

An obvious translation of

  while (atomic_load_explicit(&barf, memory_order_relaxed));

is

.L2:
        ldr w0, [x1]
        cbnz w0, .L2
        ret

which is unbounded.

With no memory fence before a read you're not going to see writes
by other cores until the core you're running on is pre-empted for
some reason.

Andrew.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Andrew Haley
In reply to this post by Doug Lea
On 01/27/2016 01:36 PM, Doug Lea wrote:
> For discussion, see jmm-dev archives mainly from last August:
> http://mail.openjdk.java.net/pipermail/jmm-dev/2015-August/thread.html

I'm looking at jdk9 APIs [Fences specifically],
Doug Lea dl at cs.oswego.edu Thu Aug 13 12:19:17 UTC 2015

which deals with using a storeStoreFence:

    void p() {      // called in producer thread
       for (int i = 0; i < 1000000; ++i)
         writes(heavyPureComputation(i));
    }

...

    void writes(int k) {
       x = k;
       y = k + 17;
       storeStoreFence(); // please actually store x and y if in a loop
    }

That makes sense.  I get it.  But with no fence at all?

Andrew.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Vitaly Davidovich
In reply to this post by Andrew Haley
That's ... bizarre :)

So m_o_relaxed is useless there? Or is it lowered with some lightweight fence to force a load of that address? What do C++ compilers emit?

Is it the same way for stores? If so, Doug's example of a publication loop that wants no delay from compiler seems moot if the core delays it without explicit fence (unless opaque will lower with these fences).

On Wednesday, January 27, 2016, Andrew Haley <[hidden email]> wrote:
On 01/27/2016 01:59 PM, Vitaly Davidovich wrote:
> why's that? how's it different from x86/64? On the surface, it seems like a
> decent fit -- don't care about ordering with respect to other operations in
> the loop, just want to read from (cache coherent) memory each time.

An obvious translation of

  while (atomic_load_explicit(&barf, memory_order_relaxed));

is

.L2:
        ldr     w0, [x1]
        cbnz    w0, .L2
        ret

which is unbounded.

With no memory fence before a read you're not going to see writes
by other cores until the core you're running on is pre-empted for
some reason.

Andrew.



--
Sent from my phone

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Andrew Haley
On 01/27/2016 03:15 PM, Vitaly Davidovich wrote:
> That's ... bizarre :)
>
> So m_o_relaxed is useless there? Or is it lowered with some lightweight
> fence to force a load of that address? What do C++ compilers emit?

For this:

  while (atomic_load_explicit(&barf, memory_order_relaxed))
    atomic_store_explicit(&barse, 0, memory_order_relaxed);

a C++ compiler emits this:

        b .L2
.L3:
        str wzr, [x2]
.L2:
        ldr w0, [x1]
        cbnz w0, .L3

> Is it the same way for stores? If so, Doug's example of a publication loop
> that wants no delay from compiler seems moot if the core delays it without
> explicit fence (unless opaque will lower with these fences).

I would have thought so. but I can see no reason for opaque to emit
a fence.  Its spec does not seem to require it.

Andrew.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Doug Lea
On 01/27/2016 10:27 AM, Andrew Haley wrote:

> I would have thought so. but I can see no reason for opaque to emit
> a fence.  Its spec does not seem to require it.
>

It does not. Among the usage examples cited are cases where people
are content to observe (very) stale values, but are not content for
them to never be written in long-running loops. For example,
a progress monitor:

static int progress;    // occasionally read by monitor thread

for (int i = 0; i < 10000000; ++i) {
   progress = i;
   work();
}

-Doug

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: jdk9 VarHandle and Fence methods

Vitaly Davidovich
What if the writing core is isolated and only this thread is runnable on it? There would be no context switch.  When would that be written and visible? What other actions on AArch64 would make it visible?

I think for archs like AArch64 a minimal fence to ensure the operation happens globally would be needed; otherwise, this API is a trap there.  IMO.

On Wednesday, January 27, 2016, Doug Lea <[hidden email]> wrote:
On 01/27/2016 10:27 AM, Andrew Haley wrote:

I would have thought so. but I can see no reason for opaque to emit
a fence.  Its spec does not seem to require it.


It does not. Among the usage examples cited are cases where people
are content to observe (very) stale values, but are not content for
them to never be written in long-running loops. For example,
a progress monitor:

static int progress;    // occasionally read by monitor thread

for (int i = 0; i < 10000000; ++i) {
  progress = i;
  work();
}

-Doug



--
Sent from my phone

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
1234