VarHandle.setVolatile vs classical volatile write

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

VarHandle.setVolatile vs classical volatile write

Dávid Karnok
Hi,

in an older blog post (https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_storeload_barrier_and_stack_usages) about write barriers, it is mentioned the JIT uses a stack local address and XADD to flush the write buffer when a volatile field is written on x86 and also mentions the option to use XCHG instead, targeting the actual memory location.

My question is, does a compiled VarHandle.setVolatile do the same XADD trick or is it using XCHG? Has there been a newer performance evaluation with XCHG since the blog post? In other terms, is there a performance penalty/benefit in changing VarHandle.setVolatile() into VarHandle.getAndSet() when considering a modern x86 ? 

My particular use case is for running code designed for concurrency in non-concurrent fashion and perhaps saving the cost of a MOVE + XADD pair when an XCHG has the very same effect.

Thank you for your time.
--
Best regards,
David Karnok

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: VarHandle.setVolatile vs classical volatile write

Paul Sandoz

On 18 Aug 2017, at 11:49, Dávid Karnok <[hidden email]> wrote:

Hi,

in an older blog post (https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_storeload_barrier_and_stack_usages) about write barriers, it is mentioned the JIT uses a stack local address and XADD to flush the write buffer when a volatile field is written on x86 and also mentions the option to use XCHG instead, targeting the actual memory location.

My question is, does a compiled VarHandle.setVolatile do the same XADD trick or is it using XCHG?

It uses the same trick, since the VarHandles implementation in OpenJDK tunnels through to Unsafe with surrounding safety checks that the compiler folds away when it knows it’s safe to do so.


Has there been a newer performance evaluation with XCHG since the blog post?

Not that i am aware of.


In other terms, is there a performance penalty/benefit in changing VarHandle.setVolatile() into VarHandle.getAndSet() when considering a modern x86 ? 


I suspect in general there may be a penalty since getAndSet provides stronger ordering (a volatile read and write), so i would hold off with any global search and replace of setVolatile with getAndSet :-)

I would be interested in looking at performance results and generated assembly from some nano benchmarks.

Paul.

My particular use case is for running code designed for concurrency in non-concurrent fashion and perhaps saving the cost of a MOVE + XADD pair when an XCHG has the very same effect.

Thank you for your time.
--
Best regards,
David Karnok
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (858 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: VarHandle.setVolatile vs classical volatile write

Dávid Karnok
Thanks. I did a benchmark (https://gist.github.com/akarnokd/c0d606bd7e29d143ee82f2026898dbb5) and got the following results:

i5 6440HQ, Windows 10 x64, Java 9b181, JMH 1.19

Benchmark                       Mode  Cnt          Score         Error  Units
VolatilePerf.getAndAdd         thrpt    5  117841308,999 ± 3940711,142  ops/s
VolatilePerf.getAndSet         thrpt    5  118162019,136 ± 1349823,016  ops/s
VolatilePerf.releaseGetAndAdd  thrpt    5  118688354,409 ±  642044,969  ops/s
VolatilePerf.setRelease        thrpt    5  890890009,555 ± 4323041,380  ops/s
VolatilePerf.setVolatile       thrpt    5  118419990,949 ±  793885,407  ops/s


Being on Windows and on a Laptop usually yields some variance, but looks like there is practically minimal difference between the full barrier operations. 

Btw, thinking about XCHG and XADD, they have to provide the same strong volatile read and write as they both read and write something atomically. I would have thought XADD involving some ALU is detectably more costly but a 3 cycle addition is relatively small compared to a 22-45 cycle cache action.


2017-08-18 21:21 GMT+02:00 Paul Sandoz <[hidden email]>:

On 18 Aug 2017, at 11:49, Dávid Karnok <[hidden email]> wrote:

Hi,

in an older blog post (https://shipilev.net/blog/2014/on-the-fence-with-dependencies/#_storeload_barrier_and_stack_usages) about write barriers, it is mentioned the JIT uses a stack local address and XADD to flush the write buffer when a volatile field is written on x86 and also mentions the option to use XCHG instead, targeting the actual memory location.

My question is, does a compiled VarHandle.setVolatile do the same XADD trick or is it using XCHG?

It uses the same trick, since the VarHandles implementation in OpenJDK tunnels through to Unsafe with surrounding safety checks that the compiler folds away when it knows it’s safe to do so.


Has there been a newer performance evaluation with XCHG since the blog post?

Not that i am aware of.


In other terms, is there a performance penalty/benefit in changing VarHandle.setVolatile() into VarHandle.getAndSet() when considering a modern x86 ? 


I suspect in general there may be a penalty since getAndSet provides stronger ordering (a volatile read and write), so i would hold off with any global search and replace of setVolatile with getAndSet :-)

I would be interested in looking at performance results and generated assembly from some nano benchmarks.

Paul.

My particular use case is for running code designed for concurrency in non-concurrent fashion and perhaps saving the cost of a MOVE + XADD pair when an XCHG has the very same effect.

Thank you for your time.
--
Best regards,
David Karnok
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest




--
Best regards,
David Karnok

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: VarHandle.setVolatile vs classical volatile write

Paul Sandoz

On 18 Aug 2017, at 13:58, Dávid Karnok <[hidden email]> wrote:

Thanks. I did a benchmark (https://gist.github.com/akarnokd/c0d606bd7e29d143ee82f2026898dbb5) and got the following results:

i5 6440HQ, Windows 10 x64, Java 9b181, JMH 1.19

Benchmark                       Mode  Cnt          Score         Error  Units
VolatilePerf.getAndAdd         thrpt    5  117841308,999 ± 3940711,142  ops/s
VolatilePerf.getAndSet         thrpt    5  118162019,136 ± 1349823,016  ops/s
VolatilePerf.releaseGetAndAdd  thrpt    5  118688354,409 ±  642044,969  ops/s
VolatilePerf.setRelease        thrpt    5  890890009,555 ± 4323041,380  ops/s
VolatilePerf.setVolatile       thrpt    5  118419990,949 ±  793885,407  ops/s


Being on Windows and on a Laptop usually yields some variance, but looks like there is practically minimal difference between the full barrier operations. 


Ok, good to know, thanks for dong that!


Btw, thinking about XCHG and XADD, they have to provide the same strong volatile read and write as they both read and write something atomically. I would have thought XADD involving some ALU is detectably more costly but a 3 cycle addition is relatively small compared to a 22-45 cycle cache action.


Yes, good point.

Paul.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: VarHandle.setVolatile vs classical volatile write

Hans Boehm
Can someone explain why setRelease seems to be appreciably slower, when it is the only one that should not need any kind of fence on x86, and hence should be much faster? Am I misreading the results?

On Fri, Aug 25, 2017 at 10:02 AM, Paul Sandoz <[hidden email]> wrote:

On 18 Aug 2017, at 13:58, Dávid Karnok <[hidden email]> wrote:

Thanks. I did a benchmark (https://gist.github.com/akarnokd/c0d606bd7e29d143ee82f2026898dbb5) and got the following results:

i5 6440HQ, Windows 10 x64, Java 9b181, JMH 1.19

Benchmark                       Mode  Cnt          Score         Error  Units
VolatilePerf.getAndAdd         thrpt    5  117841308,999 ± 3940711,142  ops/s
VolatilePerf.getAndSet         thrpt    5  118162019,136 ± 1349823,016  ops/s
VolatilePerf.releaseGetAndAdd  thrpt    5  118688354,409 ±  642044,969  ops/s
VolatilePerf.setRelease        thrpt    5  890890009,555 ± 4323041,380  ops/s
VolatilePerf.setVolatile       thrpt    5  118419990,949 ±  793885,407  ops/s


Being on Windows and on a Laptop usually yields some variance, but looks like there is practically minimal difference between the full barrier operations. 


Ok, good to know, thanks for dong that!


Btw, thinking about XCHG and XADD, they have to provide the same strong volatile read and write as they both read and write something atomically. I would have thought XADD involving some ALU is detectably more costly but a 3 cycle addition is relatively small compared to a 22-45 cycle cache action.


Yes, good point.

Paul.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: VarHandle.setVolatile vs classical volatile write

Paul Sandoz

On 28 Aug 2017, at 17:27, Hans Boehm <[hidden email]> wrote:

Can someone explain why setRelease seems to be appreciably slower, when it is the only one that should not need any kind of fence on x86, and hence should be much faster? Am I misreading the results?


Yes, you are misreading them. The units are number operations per second.

Paul.

On Fri, Aug 25, 2017 at 10:02 AM, Paul Sandoz <[hidden email]> wrote:

On 18 Aug 2017, at 13:58, Dávid Karnok <[hidden email]> wrote:

Thanks. I did a benchmark (https://gist.github.com/akarnokd/c0d606bd7e29d143ee82f2026898dbb5) and got the following results:

i5 6440HQ, Windows 10 x64, Java 9b181, JMH 1.19

Benchmark                       Mode  Cnt          Score         Error  Units
VolatilePerf.getAndAdd         thrpt    5  117841308,999 ± 3940711,142  ops/s
VolatilePerf.getAndSet         thrpt    5  118162019,136 ± 1349823,016  ops/s
VolatilePerf.releaseGetAndAdd  thrpt    5  118688354,409 ±  642044,969  ops/s
VolatilePerf.setRelease        thrpt    5  890890009,555 ± 4323041,380  ops/s
VolatilePerf.setVolatile       thrpt    5  118419990,949 ±  793885,407  ops/s


Being on Windows and on a Laptop usually yields some variance, but looks like there is practically minimal difference between the full barrier operations. 


Ok, good to know, thanks for dong that!


Btw, thinking about XCHG and XADD, they have to provide the same strong volatile read and write as they both read and write something atomically. I would have thought XADD involving some ALU is detectably more costly but a 3 cycle addition is relatively small compared to a 22-45 cycle cache action.


Yes, good point.

Paul.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest




_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: VarHandle.setVolatile vs classical volatile write

Andrew Haley
On 29/08/17 01:38, Paul Sandoz wrote:
> Yes, you are misreading them. The units are number operations per second.

This default is always very confusing.  I always use jcstress and
set units to nanoseconds.  That would make a much more sensible
default.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: VarHandle.setVolatile vs classical volatile write

Aleksey Shipilev-3
On 08/29/2017 10:04 AM, Andrew Haley wrote:
> On 29/08/17 01:38, Paul Sandoz wrote:
>> Yes, you are misreading them. The units are number operations per second.
>
> This default is always very confusing.  I always use jcstress and
> set units to nanoseconds.  That would make a much more sensible
> default.

Living in nanobenchmarks world, I generally agree. The throughput/second was the widely agreed
default at JMH 1.0 era that focused on larger benchmarks. However, I frequently see large numbers in
JMH output as the litmus test for novice user: if submitter cannot (or did not bother to) choose the
right units for experiments, maybe submitter does not know how to operate JMH, and thus the chance
the benchmarks need more attention is much higher. :) The correlation is strong with this one...

As for benchmark mode, there are always people who would expect "larger is better", and for them
"average time" would be confusing. Been there, tried that.

Thanks,
-Aleksey


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: VarHandle.setVolatile vs classical volatile write

Dávid Karnok
My target setting is usually the development of libraries to execute and support reactive dataflows where the typical question is: how many items/objects/messages can a particular flow transmit over time. Therefore, if a throughput measurement of 100Mops/s jumps to 120 Mops/s after some optimization, that is more telling to me than seeing the time go from 6ns to 5ns per op.

2017-08-29 10:19 GMT+02:00 Aleksey Shipilev <[hidden email]>:
On 08/29/2017 10:04 AM, Andrew Haley wrote:
> On 29/08/17 01:38, Paul Sandoz wrote:
>> Yes, you are misreading them. The units are number operations per second.
>
> This default is always very confusing.  I always use jcstress and
> set units to nanoseconds.  That would make a much more sensible
> default.

Living in nanobenchmarks world, I generally agree. The throughput/second was the widely agreed
default at JMH 1.0 era that focused on larger benchmarks. However, I frequently see large numbers in
JMH output as the litmus test for novice user: if submitter cannot (or did not bother to) choose the
right units for experiments, maybe submitter does not know how to operate JMH, and thus the chance
the benchmarks need more attention is much higher. :) The correlation is strong with this one...

As for benchmark mode, there are always people who would expect "larger is better", and for them
"average time" would be confusing. Been there, tried that.

Thanks,
-Aleksey


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest




--
Best regards,
David Karnok

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest