Java Memory Model and ParallelStream

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
Reposting a question I posted to jdk-dev concerning the non-volatility of array elements in Java, and assumptions that can be made about the synchronization barrier at the end of a parallel stream. Any insight into this would be appreciated.


---------- Forwarded message ---------
From: Luke Hutchison <[hidden email]>
Date: Thu, Mar 5, 2020 at 6:03 PM
Subject: Java memory model question
To: jdk-dev <[hidden email]>


Under the Java memory model, is it fair to assume that memory reads and writes can only be reorderered within a method, but not across method boundaries? (Define "method" here as what's left after any inlining has taken place.)

Specifically I'm wondering: if a thread X launches a parallel stream that writes at most once to each independent element of an array, can it be assumed that when the stream processing ends, X will always read the value of all written array elements? In other words, can the termination of the stream be seen as a memory ordering barrier (in a weak sense)? 

I'm not asking whether the following code is advisable, only whether there's any chance of the main thread reading an old value from the array.

    int N = 50;
    String[] strValues = new String[N];
    IntStream.range(0, N)
            .parallel()
            .forEach(i -> strValues[i] = Integer.toString(i));
    // (*) Are the new strValues[i] all guaranteed to be visible here?
    for (String strValue : strValues) {
        System.out.println(strValue);
    }



---------- Forwarded message ---------
From: David Holmes <[hidden email]>
Date: Thu, Mar 5, 2020 at 7:09 PM
Subject: Re: Java memory model question
To: Luke Hutchison <[hidden email]>, jdk-dev <[hidden email]>


Hi Luke,

Probably a question better asked on [hidden email]

On 6/03/2020 11:03 am, Luke Hutchison wrote:
> Under the Java memory model, is it fair to assume that memory reads and
> writes can only be reorderered within a method, but not across method
> boundaries? (Define "method" here as what's left after any inlining has
> taken place.)

No. Theoretically you could inline the entire program into a single
"method". Method entry/exit don't in themselves define synchronization
points.

> Specifically I'm wondering: if a thread X launches a parallel stream that
> writes at most once to each independent element of an array, can it be
> assumed that when the stream processing ends, X will always read the value
> of all written array elements? In other words, can the termination of the
> stream be seen as a memory ordering barrier (in a weak sense)?

I would have expected this to be explicitly stated somewhere in the
streams documentation, but I don't see it. My expectation is that
terminal operations would act as synchronization points.

> I'm not asking whether the following code is advisable, only whether
> there's any chance of the main thread reading an old value from the array.
>
>      int N = 50;
>      String[] strValues = new String[N];
>      IntStream.range(0, N)
>              .parallel()
>              .forEach(i -> strValues[i] = Integer.toString(i));
>      // (*) Are the new strValues[i] all guaranteed to be visible here?
>      for (String strValue : strValues) {
>          System.out.println(strValue);
>      }

I would expect that code to be fine. parallel() would not be usable
otherwise.

Cheers,
David


---------- Forwarded message ---------
From: Brian Goetz <[hidden email]>
Date: Fri, Mar 6, 2020 at 12:46 AM
Subject: Re: Java memory model question
To: Luke Hutchison <[hidden email]>
Cc: jdk-dev <[hidden email]>


No, but  this is a common myth.  Method boundaries are not part of the JMM, and the JIT regularly makes optimizations that have the effect of reordering operations across method boundaries.  


---------- Forwarded message ---------
From: Luke Hutchison <[hidden email]>
Date: Fri, Mar 6, 2020 at 3:27 AM
Subject: Re: Java memory model question
To: Brian Goetz <[hidden email]>
Cc: jdk-dev <[hidden email]>


On Fri, Mar 6, 2020 at 12:46 AM Brian Goetz <[hidden email]> wrote:
No, but  this is a common myth.  Method boundaries are not part of the JMM, and the JIT regularly makes optimizations that have the effect of reordering operations across method boundaries. 

Thanks. That's pretty interesting, but I can't think of an optimization that would have that effect. Can you give an example?

On Thu, Mar 5, 2020 at 7:09 PM David Holmes <[hidden email]> wrote:
Probably a question better asked on [hidden email]

Thanks, I didn't know about that list.

> can the termination of the
> stream be seen as a memory ordering barrier (in a weak sense)?

I would have expected this to be explicitly stated somewhere in the
streams documentation, but I don't see it. My expectation is that
terminal operations would act as synchronization points.

Right, although I wasn't asking about "high-level concurrency" (i.e. coordination between threads), but rather "low-level concurrency" (memory operation ordering). The question arises from the Java limitation that fields can be marked volatile, but if the field is of array type, then the individual elements of the array cannot be marked volatile. There's no "element-wise volatile" array unless you resort to using an AtomicReferenceArray, which creates a wrapper object per array element, which is wasteful on computation and space.

I understand that the lack of "element-wise volatile" arrays means that threads can end up reading stale values if two or more threads are reading from and writing to the same array elements. However for this example, I specifically exclude that issue by ensuring that there's only ever either zero readers / one writer, or any number of readers / zero writers (every array element is only written once by any thread, then after the end of the stream, there are zero writers).

I'm really just asking if there is some "macro-scale memory operation reordering" that could somehow occur across the synchronization barrier at the end of the stream. I don't know how deep the rabbit hole of memory operation reordering goes.

I have to assume this is not the case, because the worker threads should all go quiescent at the end of the stream, so should have flushed their values out to at least L1 cache, and the CPU should ensure cache coherency between all cores beyond that point. But I want to make sure that can be guaranteed.

In practice I have never seen this pattern fail, and it's exceptionally useful to be able to write to disjoint array elements from an IntStream.range(0, N) parallel stream, particularly as a pattern to very quickly parallelize orignially-serial code to have maximum efficiency, by simply replacing for loops that have no dependencies between operations with parallel streams -- but I have been nervous to use this pattern since I realized that arrays cannot have volatile elements. Logically my brain tells me the fear is unfounded, but I wanted to double check.



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
Hi,

On 3/6/20 11:40 AM, Luke Hutchison via Concurrency-interest wrote:
> Thanks. That's pretty interesting, but I can't think of an optimization that would have that effect.
> Can you give an example?

Method gets inlined, and boom: optimizer does not even see the method boundary.

> There's no "element-wise volatile" array unless you resort to using an AtomicReferenceArray,
> which creates a wrapper object per array element, which is wasteful on computation and space.

Not really related to this question, but: VarHandles provide "use-site" volatility without
"def-site" volatility. In other words, you can access any non-volatile element as if it is volatile.

> I have to assume this is not the case, because the worker threads should all go quiescent at the end
> of the stream, so should have flushed their values out to at least L1 cache, and the CPU should
> ensure cache coherency between all cores beyond that point. But I want to make sure that can be
> guaranteed.

Stop thinking in low level? That would only confuse you.

Before trying to wrap your head around Streams, consider the plain thread pool:

    ExecutorService e = Executors.newFixedThreadPool(1);
    int[] a = new int[1];
    Future<?> f = e.submit(() -> a[0]++);
    f.get();
    System.out.println(a[0]); // guaranteed to print "1".

This happens because all actions in the worker thread (so all writes in lambda body) happen-before
all actions after result acquisition (so all reads after Future.get). Parallel streams carry the
similar property.

--
Thanks,
-Aleksey


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
On Fri, Mar 6, 2020 at 3:51 AM Aleksey Shipilev <[hidden email]> wrote:
On 3/6/20 11:40 AM, Luke Hutchison via Concurrency-interest wrote:
> Thanks. That's pretty interesting, but I can't think of an optimization that would have that effect.
> Can you give an example?

Method gets inlined, and boom: optimizer does not even see the method boundary.

...which is why I specifically excluded inlining in my original question (or said consider the state after all inlining has taken place). I realize that inlining doesn't just happen at compiletime, and the JIT could decide at any point to inline a function, but I want to ignore that (very real) possibility to understand whether reordering can take place across method boundaries _if inlining never happens_. Brian Goetz commented that "JIT regularly makes optimizations that have the effect of reordering operations across method boundaries" -- so I think the answer is yes. I just don't understand how that would happen.

> There's no "element-wise volatile" array unless you resort to using an AtomicReferenceArray,
> which creates a wrapper object per array element, which is wasteful on computation and space.

Not really related to this question, but: VarHandles provide "use-site" volatility without
"def-site" volatility. In other words, you can access any non-volatile element as if it is volatile.

Thanks for the pointer, although if you need to create one VarHandle per array element to guarantee this behavior, then that's logically no different than wrapping each array element in a wrapper object with AtomicReferenceArray.

(Maybe Java could provide something like a "volatile volatile" type that could be used with array-typed fields to make "volatile" apply to elements of an array-typed field, not just to the field itself?)

> I have to assume this is not the case, because the worker threads should all go quiescent at the end
> of the stream, so should have flushed their values out to at least L1 cache, and the CPU should
> ensure cache coherency between all cores beyond that point. But I want to make sure that can be
> guaranteed.

Stop thinking in low level? That would only confuse you.

Before trying to wrap your head around Streams, consider the plain thread pool:

    ExecutorService e = Executors.newFixedThreadPool(1);
    int[] a = new int[1];
    Future<?> f = e.submit(() -> a[0]++);
    f.get();
    System.out.println(a[0]); // guaranteed to print "1".

This happens because all actions in the worker thread (so all writes in lambda body) happen-before
all actions after result acquisition (so all reads after Future.get). Parallel streams carry the
similar property.

Good example, and I guess the "guaranteed" here answers my question.

I guess fundamentally I was asking if any memory reordering (or cache staleness) can happen across synchronization barriers. It sounds like that is not the case, due to synchronization barriers implementing a computational "happens-before" guarantee, which enforces the same "happens-before" total ordering on memory operations across the barrier.



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
On 3/6/20 12:11 PM, Luke Hutchison wrote:
> ...which is why I specifically excluded inlining in my original question (or said consider the state
> after all inlining has taken place). I realize that inlining doesn't just happen at compiletime, and
> the JIT could decide at any point to inline a function, but I want to ignore that (very real)
> possibility to understand whether reordering can take place across method boundaries _if inlining
> never happens_.

That is an odd exclusion. Inlining is the mother of all optimizations: it expands the optimization
scope. But even "if" formal inlining does not happen, you can devise the closed-world/speculatve
optimizations that peek into method implementations and use that knowledge to optimize. Coming up
with the concrete example is counter-productive, IMO, because it plays into caring about
implementation specifics, rather than the high-level guarantees.


>     > There's no "element-wise volatile" array unless you resort to using an AtomicReferenceArray,
>     > which creates a wrapper object per array element, which is wasteful on computation and space.
>
>     Not really related to this question, but: VarHandles provide "use-site" volatility without
>     "def-site" volatility. In other words, you can access any non-volatile element as if it is volatile.
>
> Thanks for the pointer, although if you need to create one VarHandle per array element to guarantee
> this behavior,

No, you don't need a VH per array element, you can have one that accepts the array and the index:
https://docs.oracle.com/javase/9/docs/api/java/lang/invoke/MethodHandles.html#arrayElementVarHandle-java.lang.Class-

> then that's logically no different than wrapping each array element in a wrapper
> object with AtomicReferenceArray.

No, it is not the same. AtomicReferenceArray gives you one additional indirection to its own array.
VarHandle can do the operation _on the array you give it_.


> I guess fundamentally I was asking if any memory reordering (or cache staleness) can happen across
> synchronization barriers. It sounds like that is not the case, due to synchronization barriers
> implementing a computational "happens-before" guarantee, which enforces the same "happens-before"
> total ordering on memory operations across the barrier.

Tell me what do you mean by "Synchronization barrier" (and how it relates to the what you are
asking, to avoid XY Problem), and then we can talk about what properties does it have. Loosely
defined things can have whatever properties :)

Otherwise, look, high-level guarantees are the king. They do not force you to know the low-level
details. In just about every parallel implementation everything that worker threads do
happens-before their thread/task termination/publication, and thread/result termination/publication
happens-before the detection/acquisition of the result.

It is not really relevant how that detection/acquisition happens:
 - successful Thread.join() for a terminating worker; (guaranteed by JLS)
 - successful Future.get() from executor; (guaranteed by package spec in java.util.concurrent.*)
 - successful forEach for a parallel stream; (provided by extension?)

--
Thanks,
-Aleksey


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
OK, that answers all my questions. Thanks for taking the time to respond (and for the pointer to MethodHandles.arrayElementVarHandle).


On Fri, Mar 6, 2020 at 4:45 AM Aleksey Shipilev <[hidden email]> wrote:
On 3/6/20 12:11 PM, Luke Hutchison wrote:
> ...which is why I specifically excluded inlining in my original question (or said consider the state
> after all inlining has taken place). I realize that inlining doesn't just happen at compiletime, and
> the JIT could decide at any point to inline a function, but I want to ignore that (very real)
> possibility to understand whether reordering can take place across method boundaries _if inlining
> never happens_.

That is an odd exclusion. Inlining is the mother of all optimizations: it expands the optimization
scope. But even "if" formal inlining does not happen, you can devise the closed-world/speculatve
optimizations that peek into method implementations and use that knowledge to optimize. Coming up
with the concrete example is counter-productive, IMO, because it plays into caring about
implementation specifics, rather than the high-level guarantees.


>     > There's no "element-wise volatile" array unless you resort to using an AtomicReferenceArray,
>     > which creates a wrapper object per array element, which is wasteful on computation and space.
>
>     Not really related to this question, but: VarHandles provide "use-site" volatility without
>     "def-site" volatility. In other words, you can access any non-volatile element as if it is volatile.
>
> Thanks for the pointer, although if you need to create one VarHandle per array element to guarantee
> this behavior,

No, you don't need a VH per array element, you can have one that accepts the array and the index:
https://docs.oracle.com/javase/9/docs/api/java/lang/invoke/MethodHandles.html#arrayElementVarHandle-java.lang.Class-

> then that's logically no different than wrapping each array element in a wrapper
> object with AtomicReferenceArray.

No, it is not the same. AtomicReferenceArray gives you one additional indirection to its own array.
VarHandle can do the operation _on the array you give it_.


> I guess fundamentally I was asking if any memory reordering (or cache staleness) can happen across
> synchronization barriers. It sounds like that is not the case, due to synchronization barriers
> implementing a computational "happens-before" guarantee, which enforces the same "happens-before"
> total ordering on memory operations across the barrier.

Tell me what do you mean by "Synchronization barrier" (and how it relates to the what you are
asking, to avoid XY Problem), and then we can talk about what properties does it have. Loosely
defined things can have whatever properties :)

Otherwise, look, high-level guarantees are the king. They do not force you to know the low-level
details. In just about every parallel implementation everything that worker threads do
happens-before their thread/task termination/publication, and thread/result termination/publication
happens-before the detection/acquisition of the result.

It is not really relevant how that detection/acquisition happens:
 - successful Thread.join() for a terminating worker; (guaranteed by JLS)
 - successful Future.get() from executor; (guaranteed by package spec in java.util.concurrent.*)
 - successful forEach for a parallel stream; (provided by extension?)

--
Thanks,
-Aleksey


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list

Just to emphasize Alexey's main point...

On 3/6/20 6:44 AM, Aleksey Shipilev via Concurrency-interest wrote:

> Otherwise, look, high-level guarantees are the king.
>
> It is not really relevant how that detection/acquisition happens:
>  - successful Thread.join() for a terminating worker; (guaranteed by JLS)
>  - successful Future.get() from executor; (guaranteed by package spec in java.util.concurrent.*)
>  - successful forEach for a parallel stream; (provided by extension?)

(See also the java.util.stream package docs. Current version at:
https://docs.oracle.com/en/java/javase/13/docs/api/java.base/java/util/stream/package-summary.html

In other words, java.util.concurrent components make strong enough
guarantees that almost nobody ever needs to think about them. On the
other hand, they so often invisibly do what people expect that is too
easy for some to imagine other nonexistent rules are responsible.

-Doug

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
On Fri, Mar 6, 2020 at 5:26 AM Doug Lea via Concurrency-interest <[hidden email]> wrote:
In other words, java.util.concurrent components make strong enough
guarantees that almost nobody ever needs to think about them. On the
other hand, they so often invisibly do what people expect that is too
easy for some to imagine other nonexistent rules are responsible.

Yes, I have significantly benefited from java.util.concurrent over many years, and I'm grateful for your work on it. I use these classes in almost every program I ever write, and have done for many years. Usually I use concurrent collections along with futures for imposing a partial ordering where needed, but sometimes I have to get down to a lower level and implement custom locking schemes with semaphores and mutexes, etc. -- and semaphores and mutexes already get to the level of "this is hard enough to get right that there's a good reason for higher-level concurrency abstractions".

However I also know there was a lot of very careful and principled work that went into the implementation of java.util.concurrent, and I have run into a situation a few times where deferring to any sort of high-level abstractions just wasn't enough, sometimes because I was running into performance issues. The code example I gave does not use java.util.concurrent, because it doesn't need those classes. I just wanted to know if this pattern is safe. My logical mind said: Of course, logically how could this not be safe? -- but I have seen enough warnings about non-volatility of Java array elements that I thought I should check with the experts.



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 3/6/20 11:11 AM, Luke Hutchison via Concurrency-interest wrote:
> I realize
> that inlining doesn't just happen at compiletime, and the JIT could decide
> at any point to inline a function, but I want to ignore that (very real)
> possibility to understand whether reordering can take place across method
> boundaries _if inlining never happens_. Brian Goetz commented that "JIT
> regularly makes optimizations that have the effect of reordering operations
> across method boundaries" -- so I think the answer is yes. I just don't
> understand how that would happen.

The CPU *hardware* does this all the time. Few processors have a
total store order: x86, which does, is the exception here.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
On Fri, Mar 6, 2020, 7:15 AM Andrew Haley <[hidden email]> wrote:
The CPU *hardware* does this all the time. Few processors have a
total store order: x86, which does, is the exception here.

Well that gets at my core question: whether a computational barrier always strictly enforces memory happens-before ordering across the barrier, i.e. whether a computational barrier is always also a memory ordering barrier. 

If the CPU does not have a total store order, I could imagine cases where a computational barrier does not enforce memory ordering. What am I missing?




_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
On 3/6/20 2:35 PM, Luke Hutchison wrote:
> On Fri, Mar 6, 2020, 7:15 AM Andrew Haley <[hidden email]> wrote:
>
>> The CPU *hardware* does this all the time. Few processors have a
>> total store order: x86, which does, is the exception here.
>
> Well that gets at my core question: whether a computational barrier always
> strictly enforces memory happens-before ordering across the barrier, i.e.
> whether a computational barrier

To answer that I'd need to be told what a "computational barrier" is. I
have some guesses, but I think you should spell it out.

> is always also a memory ordering barrier.
>
> If the CPU does not have a total store order, I could imagine cases where a
> computational barrier does not enforce memory ordering. What am I missing?
--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 3/6/20 9:35 AM, Luke Hutchison via Concurrency-interest wrote:

> If the CPU does not have a total store order, I could imagine cases
> where a computational barrier does not enforce memory ordering. What am
> I missing?

You are missing us not having our act together and turning the guide at
http://gee.cs.oswego.edu/dl/html/j9mm.html along with most of a
formalization at http://compilers.cs.ucla.edu/papers/jam/ into a spec
update. Any year now...

-Doug


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
* Doug Lea via Concurrency-interest:

> On 3/6/20 9:35 AM, Luke Hutchison via Concurrency-interest wrote:
>
>> If the CPU does not have a total store order, I could imagine cases
>> where a computational barrier does not enforce memory ordering. What am
>> I missing?
>
> You are missing us not having our act together and turning the guide at
> http://gee.cs.oswego.edu/dl/html/j9mm.html along with most of a
> formalization at http://compilers.cs.ucla.edu/papers/jam/ into a spec
> update. Any year now...

And please specify it at the JVM level this time, not at the Java
language leve. 8-)
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Java Memory Model and ParallelStream

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On Fri, Mar 6, 2020, 7:45 AM Andrew Haley <[hidden email]> wrote:
To answer that I'd need to be told what a "computational barrier" is. I
have some guesses, but I think you should spell it out.

I was referring to a barrier in the standard sense of a synchronization mechanism that waits for some set of threads to finish some set of tasks before continuing. I added the word "computational" and "memory" before "barrier" to disambiguate the "happens-before" of computational work from the "happens-before" of writing values to memory, viz JMM. I then asked if these two "happens-before" relationships can be assumed to be exactly equivalent (i.e. taking into account JMM, and how JMM works with cache coherency behavior, out-of-order execution, etc.).




_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest