getAndIncrement with backoff

classic Classic list List threaded Threaded
33 messages Options
12
Reply | Threaded
Open this post in threaded view
|

getAndIncrement with backoff

JSR166 Concurrency mailing list
Hi everyone,

It seems to be common knowledge (see "Lightweight Contention Management for E cient Compare-and-Swap Operations", https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff drastically increases the throughput of CAS-based operations (e.g. getAndIncrement). I checked this on my own, and yes, this is still true for OpenJDK 11:
[single CPU] 3.4 GHz Intel Core i5 (4 cores)
[OS] macOS 10.13.6 (17G4015)
[JDK] OpenJDK 11.0.1+13

1 thread                                Mode  Cnt    Score   Error  Units
atomicLongGetAndIncrement              thrpt   45  139.645 ± 1.383  ops/us
atomicLongGetAndIncrementManual        thrpt   45  111.469 ± 0.665  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  111.545 ± 0.663  ops/us
4 threads
atomicLongGetAndIncrement              thrpt   45   50.892 ± 0.131  ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.435 ± 0.162  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45   92.659 ± 1.459  ops/us
32 threads
atomicLongGetAndIncrement              thrpt   45   51.844 ±  0.803 ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.495 ±  0.331 ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  157.771 ± 19.528 ops/us

The fact that atomicLongGetAndIncrement is faster than atomicLongGetAndIncrementManual for 1 thread tells me that as a result of @HotSpotIntrinsicCandidate this method at runtime uses an implementation different from what could have been compiled from Java sources, and this different implementation has nothing to do with backoff because with 1 thread there are no CAS failures.

What are the reasons behind the choice to not use backoff in OpenJDK's AtomicLong.getAndIncrement?

Regards,
Valentin
LinkedIn   GitHub   YouTube

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
The intrinsic probably compiles to lock:xadd on x86, which has the getAndAdd semantics.

The reason for not using it for backoff is because there is no backoff in the intrinsic. The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.


Alex

On 28 Jan 2019, at 07:29, Valentin Kovalenko via Concurrency-interest <[hidden email]> wrote:

Hi everyone,

It seems to be common knowledge (see "Lightweight Contention Management for E cient Compare-and-Swap Operations", https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff drastically increases the throughput of CAS-based operations (e.g. getAndIncrement). I checked this on my own, and yes, this is still true for OpenJDK 11:
[single CPU] 3.4 GHz Intel Core i5 (4 cores)
[OS] macOS 10.13.6 (17G4015)
[JDK] OpenJDK 11.0.1+13

1 thread                                Mode  Cnt    Score   Error  Units
atomicLongGetAndIncrement              thrpt   45  139.645 ± 1.383  ops/us
atomicLongGetAndIncrementManual        thrpt   45  111.469 ± 0.665  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  111.545 ± 0.663  ops/us
4 threads
atomicLongGetAndIncrement              thrpt   45   50.892 ± 0.131  ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.435 ± 0.162  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45   92.659 ± 1.459  ops/us
32 threads
atomicLongGetAndIncrement              thrpt   45   51.844 ±  0.803 ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.495 ±  0.331 ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  157.771 ± 19.528 ops/us

The fact that atomicLongGetAndIncrement is faster than atomicLongGetAndIncrementManual for 1 thread tells me that as a result of @HotSpotIntrinsicCandidate this method at runtime uses an implementation different from what could have been compiled from Java sources, and this different implementation has nothing to do with backoff because with 1 thread there are no CAS failures.

What are the reasons behind the choice to not use backoff in OpenJDK's AtomicLong.getAndIncrement?

Regards,
Valentin
LinkedIn   GitHub   YouTube
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
> The reason for not using it for backoff is because there is no backoff in the intrinsic
So lock:xadd does not have backoff logic, may it be then a good idea to not use this intrinsic and use "manual" getAndIncrement implementation with backoff in JDK since it results in a significantly higher throughput? In the mentioned article authors tested not only multiple architectures with many CPU threads (not 4 like I do), but also CAS-based data structures, and they achieved throughput improvements with backoff in all tested cases.

> The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.
Sure, since threads retry CAS less often once encounter failures, this effectively reduces contention.

Valentin

On Mon, 28 Jan 2019 at 01:17, Alex Otenko <[hidden email]> wrote:
The intrinsic probably compiles to lock:xadd on x86, which has the getAndAdd semantics.

The reason for not using it for backoff is because there is no backoff in the intrinsic. The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.


Alex

On 28 Jan 2019, at 07:29, Valentin Kovalenko via Concurrency-interest <[hidden email]> wrote:

Hi everyone,

It seems to be common knowledge (see "Lightweight Contention Management for E cient Compare-and-Swap Operations", https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff drastically increases the throughput of CAS-based operations (e.g. getAndIncrement). I checked this on my own, and yes, this is still true for OpenJDK 11:
[single CPU] 3.4 GHz Intel Core i5 (4 cores)
[OS] macOS 10.13.6 (17G4015)
[JDK] OpenJDK 11.0.1+13

1 thread                                Mode  Cnt    Score   Error  Units
atomicLongGetAndIncrement              thrpt   45  139.645 ± 1.383  ops/us
atomicLongGetAndIncrementManual        thrpt   45  111.469 ± 0.665  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  111.545 ± 0.663  ops/us
4 threads
atomicLongGetAndIncrement              thrpt   45   50.892 ± 0.131  ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.435 ± 0.162  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45   92.659 ± 1.459  ops/us
32 threads
atomicLongGetAndIncrement              thrpt   45   51.844 ±  0.803 ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.495 ±  0.331 ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  157.771 ± 19.528 ops/us

The fact that atomicLongGetAndIncrement is faster than atomicLongGetAndIncrementManual for 1 thread tells me that as a result of @HotSpotIntrinsicCandidate this method at runtime uses an implementation different from what could have been compiled from Java sources, and this different implementation has nothing to do with backoff because with 1 thread there are no CAS failures.

What are the reasons behind the choice to not use backoff in OpenJDK's AtomicLong.getAndIncrement?

Regards,
Valentin
LinkedIn   GitHub   YouTube
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
Last time I heard this discussed the conclusion was that there isn’t a universal backoff strategy that suits all platforms and cases.

For example, the logic of backoff used in this test case is good for continuous atomic incrementing. Will the CPUs really be spinning in a trivial increment loop? More likely you’ll slam the hot spot with multiple CPUs at once, then the threads go away to do things, then come back. Backing off the way it’s done in the test is not necessarily going to help - the win comes from one thread being able to do several increments while the others do not contend to do the same; but if the threads are not attempting to do several increments, backoff is going to work like the usual CAS loop.

lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).

Alex

On 28 Jan 2019, at 08:49, Valentin Kovalenko <[hidden email]> wrote:

> The reason for not using it for backoff is because there is no backoff in the intrinsic
So lock:xadd does not have backoff logic, may it be then a good idea to not use this intrinsic and use "manual" getAndIncrement implementation with backoff in JDK since it results in a significantly higher throughput? In the mentioned article authors tested not only multiple architectures with many CPU threads (not 4 like I do), but also CAS-based data structures, and they achieved throughput improvements with backoff in all tested cases.

> The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.
Sure, since threads retry CAS less often once encounter failures, this effectively reduces contention.

Valentin

On Mon, 28 Jan 2019 at 01:17, Alex Otenko <[hidden email]> wrote:
The intrinsic probably compiles to lock:xadd on x86, which has the getAndAdd semantics.

The reason for not using it for backoff is because there is no backoff in the intrinsic. The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.


Alex

On 28 Jan 2019, at 07:29, Valentin Kovalenko via Concurrency-interest <[hidden email]> wrote:

Hi everyone,

It seems to be common knowledge (see "Lightweight Contention Management for E cient Compare-and-Swap Operations", https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff drastically increases the throughput of CAS-based operations (e.g. getAndIncrement). I checked this on my own, and yes, this is still true for OpenJDK 11:
[single CPU] 3.4 GHz Intel Core i5 (4 cores)
[OS] macOS 10.13.6 (17G4015)
[JDK] OpenJDK 11.0.1+13

1 thread                                Mode  Cnt    Score   Error  Units
atomicLongGetAndIncrement              thrpt   45  139.645 ± 1.383  ops/us
atomicLongGetAndIncrementManual        thrpt   45  111.469 ± 0.665  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  111.545 ± 0.663  ops/us
4 threads
atomicLongGetAndIncrement              thrpt   45   50.892 ± 0.131  ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.435 ± 0.162  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45   92.659 ± 1.459  ops/us
32 threads
atomicLongGetAndIncrement              thrpt   45   51.844 ±  0.803 ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.495 ±  0.331 ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  157.771 ± 19.528 ops/us

The fact that atomicLongGetAndIncrement is faster than atomicLongGetAndIncrementManual for 1 thread tells me that as a result of @HotSpotIntrinsicCandidate this method at runtime uses an implementation different from what could have been compiled from Java sources, and this different implementation has nothing to do with backoff because with 1 thread there are no CAS failures.

What are the reasons behind the choice to not use backoff in OpenJDK's AtomicLong.getAndIncrement?

Regards,
Valentin
LinkedIn   GitHub   YouTube
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
@alex
> may it be then a good idea to not use this intrinsic and use "manual" getAndIncrement implementation with backoff in JDK since it results in a significantly higher throughput?
Depends if you're seeking max throughput or lower (global) latencies

 In the mentioned article authors tested not only multiple architectures with many CPU threads (not 4 like I do), but also CAS-based data structures, and they achieved throughput improvements with backoff in all tested cases.

It depends how the data-structures are designed: if taking advantage of higher throughput or lower latencies: I've recently created this XADD-based q on https://github.com/JCTools/JCTools/pull/230 and with a number of producers <= number of cores (pinned) I'm not getting any improvement (if not worst latencies) if I manual backoff instead of using XADD. 

Thanks for sharing the article anyway, is very interesting :)  



Il giorno lun 28 gen 2019 alle ore 09:51 Valentin Kovalenko via Concurrency-interest <[hidden email]> ha scritto:
> The reason for not using it for backoff is because there is no backoff in the intrinsic
So lock:xadd does not have backoff logic, may it be then a good idea to not use this intrinsic and use "manual" getAndIncrement implementation with backoff in JDK since it results in a significantly higher throughput? In the mentioned article authors tested not only multiple architectures with many CPU threads (not 4 like I do), but also CAS-based data structures, and they achieved throughput improvements with backoff in all tested cases.

> The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.
Sure, since threads retry CAS less often once encounter failures, this effectively reduces contention.

Valentin

On Mon, 28 Jan 2019 at 01:17, Alex Otenko <[hidden email]> wrote:
The intrinsic probably compiles to lock:xadd on x86, which has the getAndAdd semantics.

The reason for not using it for backoff is because there is no backoff in the intrinsic. The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.


Alex

On 28 Jan 2019, at 07:29, Valentin Kovalenko via Concurrency-interest <[hidden email]> wrote:

Hi everyone,

It seems to be common knowledge (see "Lightweight Contention Management for E cient Compare-and-Swap Operations", https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff drastically increases the throughput of CAS-based operations (e.g. getAndIncrement). I checked this on my own, and yes, this is still true for OpenJDK 11:
[single CPU] 3.4 GHz Intel Core i5 (4 cores)
[OS] macOS 10.13.6 (17G4015)
[JDK] OpenJDK 11.0.1+13

1 thread                                Mode  Cnt    Score   Error  Units
atomicLongGetAndIncrement              thrpt   45  139.645 ± 1.383  ops/us
atomicLongGetAndIncrementManual        thrpt   45  111.469 ± 0.665  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  111.545 ± 0.663  ops/us
4 threads
atomicLongGetAndIncrement              thrpt   45   50.892 ± 0.131  ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.435 ± 0.162  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45   92.659 ± 1.459  ops/us
32 threads
atomicLongGetAndIncrement              thrpt   45   51.844 ±  0.803 ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.495 ±  0.331 ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  157.771 ± 19.528 ops/us

The fact that atomicLongGetAndIncrement is faster than atomicLongGetAndIncrementManual for 1 thread tells me that as a result of @HotSpotIntrinsicCandidate this method at runtime uses an implementation different from what could have been compiled from Java sources, and this different implementation has nothing to do with backoff because with 1 thread there are no CAS failures.

What are the reasons behind the choice to not use backoff in OpenJDK's AtomicLong.getAndIncrement?

Regards,
Valentin
LinkedIn   GitHub   YouTube
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
@alex 
Sorry, the previous answer was for @valentin ! :P

Il giorno lun 28 gen 2019 alle ore 10:25 Alex Otenko via Concurrency-interest <[hidden email]> ha scritto:
Last time I heard this discussed the conclusion was that there isn’t a universal backoff strategy that suits all platforms and cases.

For example, the logic of backoff used in this test case is good for continuous atomic incrementing. Will the CPUs really be spinning in a trivial increment loop? More likely you’ll slam the hot spot with multiple CPUs at once, then the threads go away to do things, then come back. Backing off the way it’s done in the test is not necessarily going to help - the win comes from one thread being able to do several increments while the others do not contend to do the same; but if the threads are not attempting to do several increments, backoff is going to work like the usual CAS loop.

lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).

Alex


On 28 Jan 2019, at 08:49, Valentin Kovalenko <[hidden email]> wrote:

> The reason for not using it for backoff is because there is no backoff in the intrinsic
So lock:xadd does not have backoff logic, may it be then a good idea to not use this intrinsic and use "manual" getAndIncrement implementation with backoff in JDK since it results in a significantly higher throughput? In the mentioned article authors tested not only multiple architectures with many CPU threads (not 4 like I do), but also CAS-based data structures, and they achieved throughput improvements with backoff in all tested cases.

> The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.
Sure, since threads retry CAS less often once encounter failures, this effectively reduces contention.

Valentin

On Mon, 28 Jan 2019 at 01:17, Alex Otenko <[hidden email]> wrote:
The intrinsic probably compiles to lock:xadd on x86, which has the getAndAdd semantics.

The reason for not using it for backoff is because there is no backoff in the intrinsic. The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.


Alex

On 28 Jan 2019, at 07:29, Valentin Kovalenko via Concurrency-interest <[hidden email]> wrote:

Hi everyone,

It seems to be common knowledge (see "Lightweight Contention Management for E cient Compare-and-Swap Operations", https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff drastically increases the throughput of CAS-based operations (e.g. getAndIncrement). I checked this on my own, and yes, this is still true for OpenJDK 11:
[single CPU] 3.4 GHz Intel Core i5 (4 cores)
[OS] macOS 10.13.6 (17G4015)
[JDK] OpenJDK 11.0.1+13

1 thread                                Mode  Cnt    Score   Error  Units
atomicLongGetAndIncrement              thrpt   45  139.645 ± 1.383  ops/us
atomicLongGetAndIncrementManual        thrpt   45  111.469 ± 0.665  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  111.545 ± 0.663  ops/us
4 threads
atomicLongGetAndIncrement              thrpt   45   50.892 ± 0.131  ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.435 ± 0.162  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45   92.659 ± 1.459  ops/us
32 threads
atomicLongGetAndIncrement              thrpt   45   51.844 ±  0.803 ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.495 ±  0.331 ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  157.771 ± 19.528 ops/us

The fact that atomicLongGetAndIncrement is faster than atomicLongGetAndIncrementManual for 1 thread tells me that as a result of @HotSpotIntrinsicCandidate this method at runtime uses an implementation different from what could have been compiled from Java sources, and this different implementation has nothing to do with backoff because with 1 thread there are no CAS failures.

What are the reasons behind the choice to not use backoff in OpenJDK's AtomicLong.getAndIncrement?

Regards,
Valentin
LinkedIn   GitHub   YouTube
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 1/28/19 9:24 AM, Francesco Nigro via Concurrency-interest wrote:
> @alex

>> may it be then a good idea to not use this intrinsic and use
>> "manual" getAndIncrement implementation with backoff in JDK since
>> it results in a significantly higher throughput?

> Depends if you're seeking max throughput or lower (global) latencies

I wonder. If your getAndIncrement() is implemented with a LL/SC loop,
the backoff path is only executed when an SC fails, so unless there
really is significant contention there won't be any increase of
latency.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
@andrew 
correct, but my concern about latencies is actually the backoff strategy itself: if not chosen correctly would introduce safepoint polls that "could" lead to unpredictable slowdown on specific cases. 
Same should be said if using Thread::onSpinWait that could be backed by a pause instruction and recently on the center of a discussion about its effectiveness (bad implemented AFAIK).

As @alex has said I'm totally of the idea that a real data structure will provide something interesting to be done when a backoff should occur, but if it's not the case the risk is just to spend time tuning htimer to avoid park strategies to work as expected...


Il giorno lun 28 gen 2019, 12:07 Andrew Haley <[hidden email]> ha scritto:
On 1/28/19 9:24 AM, Francesco Nigro via Concurrency-interest wrote:
> @alex

>> may it be then a good idea to not use this intrinsic and use
>> "manual" getAndIncrement implementation with backoff in JDK since
>> it results in a significantly higher throughput?

> Depends if you're seeking max throughput or lower (global) latencies

I wonder. If your getAndIncrement() is implemented with a LL/SC loop,
the backoff path is only executed when an SC fails, so unless there
really is significant contention there won't be any increase of
latency.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
On 1/28/19 11:20 AM, Francesco Nigro wrote:

> correct, but my concern about latencies is actually the backoff
> strategy itself: if not chosen correctly would introduce safepoint
> polls that "could" lead to unpredictable slowdown on specific cases.

True, but that would be a really extreme case.

> Same should be said if using Thread::onSpinWait that could be backed
> by a pause instruction and recently on the center of a discussion
> about its effectiveness (bad implemented AFAIK).

I think so too. I'm waiting to see some proper working measurements on
AArch64 to convince me that it's useful.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
On 1/28/19 11:33 AM, Andrew Haley via Concurrency-interest wrote:

> On 1/28/19 11:20 AM, Francesco Nigro wrote:
>
>> correct, but my concern about latencies is actually the backoff
>> strategy itself: if not chosen correctly would introduce safepoint
>> polls that "could" lead to unpredictable slowdown on specific cases.
>
> True, but that would be a really extreme case.
>
>> Same should be said if using Thread::onSpinWait that could be backed
>> by a pause instruction and recently on the center of a discussion
>> about its effectiveness (bad implemented AFAIK).
>
> I think so too. I'm waiting to see some proper working measurements on
> AArch64 to convince me that it's useful.

e.g. http://mail.openjdk.java.net/pipermail/aarch64-port-dev/2017-August/004870.html

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 1/28/19 2:29 AM, Valentin Kovalenko via Concurrency-interest wrote:

> It seems to be common knowledge (see "Lightweight Contention Management
> for E cient Compare-and-Swap Operations",
> https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff
> drastically increases the throughput of CAS-based operations (e.g.
> getAndIncrement). I checked this on my own, and yes, this is still true
> for OpenJDK 11:

I am surprised that you did not compare LongAdder, that uses CAS
failures to guide contention-spreading that is usually much more
effective than backoff. On the other hand it cannot be used when you
need the atomic value of the "get".

-Doug

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
@velentin

> But Francesco provided an
explanation of how this still can negatively affect the performance
(though, I, unfortunately, didn't understand it). Francesco, may I ask you
to explain the same thing for a less educated audience?:)

Sorry: i've spoken without context :P
The long story short is that there is an implementation detail (but sort-of-defined in the OpenJDK glossary) of the JVM called "safepoint": these are safe states
in which the JVM can perform specific operations/optimizations relying on the fact that during these intervals the mutator threads ie Java threads cannot break 
specific invariants that allows to perform such operations.
The mechanism that allow to reach a safepoint (global or individual/per-Java-Thread, given that recent changes on JDK 10 has introduced the notion of thread-.ocal handshake on https://bugs.openjdk.java.net/browse/JDK-8185640), is by adding safepoint polls (using the LinuxPefAsmProfiler of JMH show them as {poll} or {poll_return}
in the annotated ASM) among the compiled code instructions (the bytecode too, not just ASM).
Such polls, when reached AND a local/global safepoint is needed, will lead to issue a SEGV (aka segmentation fault) that, when handled, allows the JVM to start the local/global safepoint.

If a backoff strategy contains a safepoint poll the risk lie when a global safepoint is being requested; it will wait untill all the mutator threads will reach it AND the safepoint operation(s) will finish, 
leading to latencies outliers (or maybe just spikes, given that you could choose  -XX:GuaranteedSafepointInterval too).

Doing the same thing in C should be equivalent, considering the most concurrent primitives are intrinsics (and won't contains such polls AFAIK) and the resulting ASM should be the same: 
the reality is that if you write your own code you can't be sure that polls are not added, unless you enjoy reading both the JVM source code and ASM from JMH :P)

AS @andrew as already written (and he for sure he can correct any imprecision about safepoint/safepoint polls I've written) is not something that a user should care about, but if you're seeking
the best sustainable throughput (or just predictable latencies) is something that IMHO you should consider at least: then you can ignore it right after :P

@doug
LongAdder rocks +100

Cheers,
Franz (Francesco is longer and most people prefer call me with this :))

Il giorno lun 28 gen 2019 alle ore 12:53 Doug Lea via Concurrency-interest <[hidden email]> ha scritto:
On 1/28/19 2:29 AM, Valentin Kovalenko via Concurrency-interest wrote:

> It seems to be common knowledge (see "Lightweight Contention Management
> for E cient Compare-and-Swap Operations",
> https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff
> drastically increases the throughput of CAS-based operations (e.g.
> getAndIncrement). I checked this on my own, and yes, this is still true
> for OpenJDK 11:

I am surprised that you did not compare LongAdder, that uses CAS
failures to guide contention-spreading that is usually much more
effective than backoff. On the other hand it cannot be used when you
need the atomic value of the "get".

-Doug

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 1/28/19 7:29 AM, Valentin Kovalenko via Concurrency-interest wrote:
> (the test and full measurements is available here https://github.com/stIncMale/sandbox/blob/master/benchmarks/src/test/java/stincmale/sandbox/benchmarks/AtomicApiComparisonTest.java)

I can't build this stuff.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
On 1/28/19 2:32 PM, Andrew Haley via Concurrency-interest wrote:
> On 1/28/19 7:29 AM, Valentin Kovalenko via Concurrency-interest wrote:
>> (the test and full measurements is available here https://github.com/stIncMale/sandbox/blob/master/benchmarks/src/test/java/stincmale/sandbox/benchmarks/AtomicApiComparisonTest.java)
>
> I can't build this stuff.

Actually, I got it done, as long as I use exactly jdk11. No idea how to
run the test though. A simple JMH test would have helped a lot.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
The key difference is that, on architectures that support atomic increment or atomic add instructions, getAndIncrement() can (and should) be implemented as a wait-free construct. For various concurrent mechanisms that would use this construct, this makes the difference between being wait free and lock free. And that difference is generally much more important and valuable than the (potentially non-forward-progress-guaranteeing-for-each-thread) throughout benefit of the primitive itself under highly contended situations.

Note that e.g. x86’s lock xadd is wait free in implementation. But so is the LOCK CAS instruction, and so are LL and SC instructions on any processor that support them. AFAIK all user mode instructions are wait free, by definition, and must complete in a small and reasonable bound as a fundamental requirement of proper hardware design (this quality is easy to deduce from basic expectations we have of all hardware with user mode instructions: e.g. just think of the obvious system-hanging user-mode DoS attack if any non-wait-free user mode instructions actually existed).

Any looping or redo logic is always a pure software implementation detail. Whether or not a given construct can be implemented to be wait free depends on the wait-free primitives it has at its disposal (which in turn depend on the capabilities of hardware instructions available).

For the basic atomic constructs (like the various atomic operations offered by the JDK for e.g. AtomicLong) the choice should always be to implement the construct in a wait-free manner where possible. Wherever we don’t, we hobble the ability of concurrent mechanisms written using those constructs to be wait-free.

Sent from my iPad

On Jan 28, 2019, at 1:25 AM, Alex Otenko via Concurrency-interest <[hidden email]> wrote:

Last time I heard this discussed the conclusion was that there isn’t a universal backoff strategy that suits all platforms and cases.

For example, the logic of backoff used in this test case is good for continuous atomic incrementing. Will the CPUs really be spinning in a trivial increment loop? More likely you’ll slam the hot spot with multiple CPUs at once, then the threads go away to do things, then come back. Backing off the way it’s done in the test is not necessarily going to help - the win comes from one thread being able to do several increments while the others do not contend to do the same; but if the threads are not attempting to do several increments, backoff is going to work like the usual CAS loop.

lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).

Alex

On 28 Jan 2019, at 08:49, Valentin Kovalenko <[hidden email]> wrote:

> The reason for not using it for backoff is because there is no backoff in the intrinsic
So lock:xadd does not have backoff logic, may it be then a good idea to not use this intrinsic and use "manual" getAndIncrement implementation with backoff in JDK since it results in a significantly higher throughput? In the mentioned article authors tested not only multiple architectures with many CPU threads (not 4 like I do), but also CAS-based data structures, and they achieved throughput improvements with backoff in all tested cases.

> The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.
Sure, since threads retry CAS less often once encounter failures, this effectively reduces contention.

Valentin

On Mon, 28 Jan 2019 at 01:17, Alex Otenko <[hidden email]> wrote:
The intrinsic probably compiles to lock:xadd on x86, which has the getAndAdd semantics.

The reason for not using it for backoff is because there is no backoff in the intrinsic. The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.


Alex

On 28 Jan 2019, at 07:29, Valentin Kovalenko via Concurrency-interest <[hidden email]> wrote:

Hi everyone,

It seems to be common knowledge (see "Lightweight Contention Management for E cient Compare-and-Swap Operations", https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff drastically increases the throughput of CAS-based operations (e.g. getAndIncrement). I checked this on my own, and yes, this is still true for OpenJDK 11:
[single CPU] 3.4 GHz Intel Core i5 (4 cores)
[OS] macOS 10.13.6 (17G4015)
[JDK] OpenJDK 11.0.1+13

1 thread                                Mode  Cnt    Score   Error  Units
atomicLongGetAndIncrement              thrpt   45  139.645 ± 1.383  ops/us
atomicLongGetAndIncrementManual        thrpt   45  111.469 ± 0.665  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  111.545 ± 0.663  ops/us
4 threads
atomicLongGetAndIncrement              thrpt   45   50.892 ± 0.131  ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.435 ± 0.162  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45   92.659 ± 1.459  ops/us
32 threads
atomicLongGetAndIncrement              thrpt   45   51.844 ±  0.803 ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.495 ±  0.331 ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  157.771 ± 19.528 ops/us

The fact that atomicLongGetAndIncrement is faster than atomicLongGetAndIncrementManual for 1 thread tells me that as a result of @HotSpotIntrinsicCandidate this method at runtime uses an implementation different from what could have been compiled from Java sources, and this different implementation has nothing to do with backoff because with 1 thread there are no CAS failures.

What are the reasons behind the choice to not use backoff in OpenJDK's AtomicLong.getAndIncrement?

Regards,
Valentin
LinkedIn   GitHub   YouTube
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
I think you are right, as long as we stay out of the rabbit hole.

I don’t know how hardware works, but at the finest level two cores attempting to lock the bus, or cache line, or whatever the mechanism to implement mutual exclusion, at that level one of the cores has to wait. So all we do is delegate the wait to that hardware level wait. (But that’s if we do go down the rabbit hole)

Alex

On 28 Jan 2019, at 16:48, Gil Tene <[hidden email]> wrote:

The key difference is that, on architectures that support atomic increment or atomic add instructions, getAndIncrement() can (and should) be implemented as a wait-free construct. For various concurrent mechanisms that would use this construct, this makes the difference between being wait free and lock free. And that difference is generally much more important and valuable than the (potentially non-forward-progress-guaranteeing-for-each-thread) throughout benefit of the primitive itself under highly contended situations.

Note that e.g. x86’s lock xadd is wait free in implementation. But so is the LOCK CAS instruction, and so are LL and SC instructions on any processor that support them. AFAIK all user mode instructions are wait free, by definition, and must complete in a small and reasonable bound as a fundamental requirement of proper hardware design (this quality is easy to deduce from basic expectations we have of all hardware with user mode instructions: e.g. just think of the obvious system-hanging user-mode DoS attack if any non-wait-free user mode instructions actually existed).

Any looping or redo logic is always a pure software implementation detail. Whether or not a given construct can be implemented to be wait free depends on the wait-free primitives it has at its disposal (which in turn depend on the capabilities of hardware instructions available).

For the basic atomic constructs (like the various atomic operations offered by the JDK for e.g. AtomicLong) the choice should always be to implement the construct in a wait-free manner where possible. Wherever we don’t, we hobble the ability of concurrent mechanisms written using those constructs to be wait-free.

Sent from my iPad

On Jan 28, 2019, at 1:25 AM, Alex Otenko via Concurrency-interest <[hidden email]> wrote:

Last time I heard this discussed the conclusion was that there isn’t a universal backoff strategy that suits all platforms and cases.

For example, the logic of backoff used in this test case is good for continuous atomic incrementing. Will the CPUs really be spinning in a trivial increment loop? More likely you’ll slam the hot spot with multiple CPUs at once, then the threads go away to do things, then come back. Backing off the way it’s done in the test is not necessarily going to help - the win comes from one thread being able to do several increments while the others do not contend to do the same; but if the threads are not attempting to do several increments, backoff is going to work like the usual CAS loop.

lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).

Alex

On 28 Jan 2019, at 08:49, Valentin Kovalenko <[hidden email]> wrote:

> The reason for not using it for backoff is because there is no backoff in the intrinsic
So lock:xadd does not have backoff logic, may it be then a good idea to not use this intrinsic and use "manual" getAndIncrement implementation with backoff in JDK since it results in a significantly higher throughput? In the mentioned article authors tested not only multiple architectures with many CPU threads (not 4 like I do), but also CAS-based data structures, and they achieved throughput improvements with backoff in all tested cases.

> The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.
Sure, since threads retry CAS less often once encounter failures, this effectively reduces contention.

Valentin

On Mon, 28 Jan 2019 at 01:17, Alex Otenko <[hidden email]> wrote:
The intrinsic probably compiles to lock:xadd on x86, which has the getAndAdd semantics.

The reason for not using it for backoff is because there is no backoff in the intrinsic. The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.


Alex

On 28 Jan 2019, at 07:29, Valentin Kovalenko via Concurrency-interest <[hidden email]> wrote:

Hi everyone,

It seems to be common knowledge (see "Lightweight Contention Management for E cient Compare-and-Swap Operations", https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff drastically increases the throughput of CAS-based operations (e.g. getAndIncrement). I checked this on my own, and yes, this is still true for OpenJDK 11:
[single CPU] 3.4 GHz Intel Core i5 (4 cores)
[OS] macOS 10.13.6 (17G4015)
[JDK] OpenJDK 11.0.1+13

1 thread                                Mode  Cnt    Score   Error  Units
atomicLongGetAndIncrement              thrpt   45  139.645 ± 1.383  ops/us
atomicLongGetAndIncrementManual        thrpt   45  111.469 ± 0.665  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  111.545 ± 0.663  ops/us
4 threads
atomicLongGetAndIncrement              thrpt   45   50.892 ± 0.131  ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.435 ± 0.162  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45   92.659 ± 1.459  ops/us
32 threads
atomicLongGetAndIncrement              thrpt   45   51.844 ±  0.803 ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.495 ±  0.331 ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  157.771 ± 19.528 ops/us

The fact that atomicLongGetAndIncrement is faster than atomicLongGetAndIncrementManual for 1 thread tells me that as a result of @HotSpotIntrinsicCandidate this method at runtime uses an implementation different from what could have been compiled from Java sources, and this different implementation has nothing to do with backoff because with 1 thread there are no CAS failures.

What are the reasons behind the choice to not use backoff in OpenJDK's AtomicLong.getAndIncrement?

Regards,
Valentin
LinkedIn   GitHub   YouTube
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list


Sent from my iPad

On Jan 28, 2019, at 8:54 AM, Alex Otenko <[hidden email]> wrote:

I think you are right, as long as we stay out of the rabbit hole.

I don’t know how hardware works, but at the finest level two cores attempting to lock the bus, or cache line, or whatever the mechanism to implement mutual exclusion, at that level one of the cores has to wait. So all we do is delegate the wait to that hardware level wait. (But that’s if we do go down the rabbit hole)

That rabbit hole takes you to a world where any waits are completely bounded (and where the bounds are very small and reasonable). It is a wait-free rabbit hole.


Alex

On 28 Jan 2019, at 16:48, Gil Tene <[hidden email]> wrote:

The key difference is that, on architectures that support atomic increment or atomic add instructions, getAndIncrement() can (and should) be implemented as a wait-free construct. For various concurrent mechanisms that would use this construct, this makes the difference between being wait free and lock free. And that difference is generally much more important and valuable than the (potentially non-forward-progress-guaranteeing-for-each-thread) throughout benefit of the primitive itself under highly contended situations.

Note that e.g. x86’s lock xadd is wait free in implementation. But so is the LOCK CAS instruction, and so are LL and SC instructions on any processor that support them. AFAIK all user mode instructions are wait free, by definition, and must complete in a small and reasonable bound as a fundamental requirement of proper hardware design (this quality is easy to deduce from basic expectations we have of all hardware with user mode instructions: e.g. just think of the obvious system-hanging user-mode DoS attack if any non-wait-free user mode instructions actually existed).

Any looping or redo logic is always a pure software implementation detail. Whether or not a given construct can be implemented to be wait free depends on the wait-free primitives it has at its disposal (which in turn depend on the capabilities of hardware instructions available).

For the basic atomic constructs (like the various atomic operations offered by the JDK for e.g. AtomicLong) the choice should always be to implement the construct in a wait-free manner where possible. Wherever we don’t, we hobble the ability of concurrent mechanisms written using those constructs to be wait-free.

Sent from my iPad

On Jan 28, 2019, at 1:25 AM, Alex Otenko via Concurrency-interest <[hidden email]> wrote:

Last time I heard this discussed the conclusion was that there isn’t a universal backoff strategy that suits all platforms and cases.

For example, the logic of backoff used in this test case is good for continuous atomic incrementing. Will the CPUs really be spinning in a trivial increment loop? More likely you’ll slam the hot spot with multiple CPUs at once, then the threads go away to do things, then come back. Backing off the way it’s done in the test is not necessarily going to help - the win comes from one thread being able to do several increments while the others do not contend to do the same; but if the threads are not attempting to do several increments, backoff is going to work like the usual CAS loop.

lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).

Alex

On 28 Jan 2019, at 08:49, Valentin Kovalenko <[hidden email]> wrote:

> The reason for not using it for backoff is because there is no backoff in the intrinsic
So lock:xadd does not have backoff logic, may it be then a good idea to not use this intrinsic and use "manual" getAndIncrement implementation with backoff in JDK since it results in a significantly higher throughput? In the mentioned article authors tested not only multiple architectures with many CPU threads (not 4 like I do), but also CAS-based data structures, and they achieved throughput improvements with backoff in all tested cases.

> The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.
Sure, since threads retry CAS less often once encounter failures, this effectively reduces contention.

Valentin

On Mon, 28 Jan 2019 at 01:17, Alex Otenko <[hidden email]> wrote:
The intrinsic probably compiles to lock:xadd on x86, which has the getAndAdd semantics.

The reason for not using it for backoff is because there is no backoff in the intrinsic. The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.


Alex

On 28 Jan 2019, at 07:29, Valentin Kovalenko via Concurrency-interest <[hidden email]> wrote:

Hi everyone,

It seems to be common knowledge (see "Lightweight Contention Management for E cient Compare-and-Swap Operations", https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff drastically increases the throughput of CAS-based operations (e.g. getAndIncrement). I checked this on my own, and yes, this is still true for OpenJDK 11:
[single CPU] 3.4 GHz Intel Core i5 (4 cores)
[OS] macOS 10.13.6 (17G4015)
[JDK] OpenJDK 11.0.1+13

1 thread                                Mode  Cnt    Score   Error  Units
atomicLongGetAndIncrement              thrpt   45  139.645 ± 1.383  ops/us
atomicLongGetAndIncrementManual        thrpt   45  111.469 ± 0.665  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  111.545 ± 0.663  ops/us
4 threads
atomicLongGetAndIncrement              thrpt   45   50.892 ± 0.131  ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.435 ± 0.162  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45   92.659 ± 1.459  ops/us
32 threads
atomicLongGetAndIncrement              thrpt   45   51.844 ±  0.803 ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.495 ±  0.331 ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  157.771 ± 19.528 ops/us

The fact that atomicLongGetAndIncrement is faster than atomicLongGetAndIncrementManual for 1 thread tells me that as a result of @HotSpotIntrinsicCandidate this method at runtime uses an implementation different from what could have been compiled from Java sources, and this different implementation has nothing to do with backoff because with 1 thread there are no CAS failures.

What are the reasons behind the choice to not use backoff in OpenJDK's AtomicLong.getAndIncrement?

Regards,
Valentin
LinkedIn   GitHub   YouTube
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 1/28/19 4:52 PM, Alex Otenko via Concurrency-interest wrote:

> I don’t know how hardware works, but at the finest level two cores
> attempting to lock the bus, or cache line, or whatever the mechanism
> to implement mutual exclusion, at that level one of the cores has to
> wait.

Yes, I agree, but that's true whenever you write to memory. When memory
is updated, only one core has the cache line in exclusive state: the
others have read-only local copies. There's nothing special happening
with something like a CAS in that regard. The cache coherency protocol
handles it in the same way as usual.

However, there is an exception to this. With Arm's "far atomic"
instructions, the intention is not to move cache lines around at
all. AIUI, whichever core already a the cache line in exclusive
state keeps it locally. The core issuing the far atomic op sends a
message to whichever core owns the cache line asking for the op to be
done remotely. The acknowledgment message contains the result. This
avoids cache-line ping-ponging altogether, at least in theory. There
will be some delay at times of high contention because processors will
be waiting for replies, of course.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
[Andrew Haley] > I can't build this stuff.
> Actually, I got it done, as long as I use exactly jdk11. No idea how to run the test though. A simple JMH test would have helped a lot.

This is, in fact, a plain JMH test :) It can be run by executing
mvn clean test -f benchmarks/pom.xml -Dstincmale.sandbox.benchmarks.dryRun=false -Dtest=AtomicApiComparisonTest

The only non-JMH part is that I start it via Junit5 runner, but one may simply add a main method which would call throughputThreads4/throughputThreads32 methods directly.

[Gil Tene] > x86’s lock xadd is wait free in implementation
Oh, I didn't know this. Yes, the manual retry cycle with CAS is only lock-free. Interestingly, atomics documentation (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/atomic/package-summary.html) only guarantees lock freedom, and does not even mention wait freedom. So we can't exactly rely on it anyway.

Valentin

On Mon, 28 Jan 2019 at 09:53, <[hidden email]> wrote:
Send Concurrency-interest mailing list submissions to
        [hidden email]

To subscribe or unsubscribe via the World Wide Web, visit
        http://cs.oswego.edu/mailman/listinfo/concurrency-interest
or, via email, send a message with subject or body 'help' to
        [hidden email]

You can reach the person managing the list at
        [hidden email]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Concurrency-interest digest..."


Today's Topics:

   1. Re: getAndIncrement with backoff (Alex Otenko)
   2. Re: getAndIncrement with backoff (Andrew Haley)
   3. Re: getAndIncrement with backoff (Andrew Haley)
   4. Re: getAndIncrement with backoff (Nathan and Ila Reynolds)
   5. Re: getAndIncrement with backoff (Gil Tene)


----------------------------------------------------------------------

Message: 1
Date: Mon, 28 Jan 2019 12:52:54 +0000
From: Alex Otenko <[hidden email]>
To: Valentin Kovalenko <[hidden email]>
Cc: concurrency-interest <[hidden email]>
Subject: Re: [concurrency-interest] getAndIncrement with backoff
Message-ID: <[hidden email]>
Content-Type: text/plain;       charset=utf-8

Well, you can take a look at the locks used for synchronized. The logic there is aiming at precisely the same cases: should it bias to one thread, should it be a spin-lock, should it inflate to a "fat lock" (OS-provided primitive). But each lock behaves differently, so the JVM makes different decisions for each individual lock.

The atomic increment intrinsics will suffer from the same - each hot spot will require different treatment (but then you also have to come up with, and justify the cost of, added tracking for contention of individual atomic longs).

Alex

> On 28 Jan 2019, at 11:46, Valentin Kovalenko via Concurrency-interest <[hidden email]> wrote:
>
> > there isn’t a universal backoff strategy that suits all platforms and cases
>
> It does not seem a problem to have different intrinsic backoff strategies
> for different platforms, but this, of course, can't be done for different
> cases. However, since backoff only comes to play in case of a CAS failure
> (or even multiple failures), it should neither affect throughput nor
> latency for a low contention scenario. But Francesco provided an
> explanation of how this still can negatively affect the performance
> (though, I, unfortunately, didn't understand it). Francesco, may I ask you
> to explain the same thing for a less educated audience?:)
>
> > lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).
>
> So basically this is the same idea we usually use in programming: rely on
> the underlying layer's implementation if it is provided. And the underlying
> layer is hardware in this case.
>
> Valentin
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest



------------------------------

Message: 2
Date: Mon, 28 Jan 2019 14:32:55 +0000
From: Andrew Haley <[hidden email]>
To: Valentin Kovalenko <[hidden email]>,
        concurrency-interest <[hidden email]>
Subject: Re: [concurrency-interest] getAndIncrement with backoff
Message-ID: <[hidden email]>
Content-Type: text/plain; charset=utf-8

On 1/28/19 7:29 AM, Valentin Kovalenko via Concurrency-interest wrote:
> (the test and full measurements is available here https://github.com/stIncMale/sandbox/blob/master/benchmarks/src/test/java/stincmale/sandbox/benchmarks/AtomicApiComparisonTest.java)

I can't build this stuff.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


------------------------------

Message: 3
Date: Mon, 28 Jan 2019 14:50:03 +0000
From: Andrew Haley <[hidden email]>
To: Valentin Kovalenko <[hidden email]>,
        concurrency-interest <[hidden email]>
Subject: Re: [concurrency-interest] getAndIncrement with backoff
Message-ID: <[hidden email]>
Content-Type: text/plain; charset=utf-8

On 1/28/19 2:32 PM, Andrew Haley via Concurrency-interest wrote:
> On 1/28/19 7:29 AM, Valentin Kovalenko via Concurrency-interest wrote:
>> (the test and full measurements is available here https://github.com/stIncMale/sandbox/blob/master/benchmarks/src/test/java/stincmale/sandbox/benchmarks/AtomicApiComparisonTest.java)
>
> I can't build this stuff.

Actually, I got it done, as long as I use exactly jdk11. No idea how to
run the test though. A simple JMH test would have helped a lot.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


------------------------------

Message: 4
Date: Mon, 28 Jan 2019 08:48:28 -0700
From: Nathan and Ila Reynolds <[hidden email]>
To: [hidden email]
Subject: Re: [concurrency-interest] getAndIncrement with backoff
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="utf-8"; Format="flowed"

Here is a clarification about lock:xadd on Intel x86 platforms. The
hardware does **not** execute the operation in a CAS-like loop with a
retry.  The hardware thread retrieves the cache line in the exclusive
state, loads the value, does the add and stores the value in the cache
line.  While the operation is progressing, the core will not allow the
cache line to be modified by another hardware thread.  Hence, there is
no chance for failure or a retry.

Here is a typical CAS loop I write in software...

do
{
    expect = m_value.get();
    update = compute(expect);
}
while (!m_value.compareAndSet(expect, update));

The hardware thread will retrieve the cache line in the shared state
during m_value.get().  The cache line is the vulnerable to being
modified by another hardware thread during compute(expect). The hardware
thread then executes compareAndSet().  It changes the cache line to the
exclusive state, loads the value, compares the value to expect and if
equal, stores update.  While the CAS instruction is executing, no other
hardware thread can modify the cache line just like lock:xadd.  A
backoff supposedly can help with reducing the number of times
compute(expect) is executed.

There is another concern.  m_value.get() retrieves the cache line in the
shared state which costs a bus operation and then compareAndSet() has to
change the cache line to the exclusive state which costs another more
expensive bus operation.  Think about this code.

do
{
    m_nearby = 0;
    expect   = m_value.get();
    update   = compute(expect);
}
while (!m_value.compareAndSet(expect, update));

The idea is to retrieve the cache line in the exclusive state to begin
with by writing to the cache line at the beginning of the loop (i.e.
m_nearby = 0).  This will reduce the number of bus operations.  I have
never tried this but it might perform better.

If compute() is a long operation, an idea is to use transactional
memory.  If there is a conflict, then the transaction aborts immediately
and the hardware thread starts over immediately.  This way the hardware
thread does not waste time finishing the execution of compute() when the
store to m_value will definitely fail.

beginTransaction();

m_value= compute(m_value);

commitTransaction();

-Nathan

On 1/28/2019 4:46 AM, Valentin Kovalenko via Concurrency-interest wrote:

> > there isn’t a universal backoff strategy that suits all platforms
> and cases
>
> It does not seem a problem to have different intrinsic backoff strategies
> for different platforms, but this, of course, can't be done for different
> cases. However, since backoff only comes to play in case of a CAS failure
> (or even multiple failures), it should neither affect throughput nor
> latency for a low contention scenario. But Francesco provided an
> explanation of how this still can negatively affect the performance
> (though, I, unfortunately, didn't understand it). Francesco, may I ask you
> to explain the same thing for a less educated audience?:)
>
> > lock:xadd defers this to the hardware. The instruction has no “redo”
> - but of course internally at hardware level it probably is still a
> CAS-like loop with a retry - and a backoff (I think all consensus
> protocols have corner cases that can only be resolved through timing a
> race).
>
> So basically this is the same idea we usually use in programming: rely on
> the underlying layer's implementation if it is provided. And the
> underlying
> layer is hardware in this case.
>
> Valentin
>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20190128/ac3b5bdc/attachment-0001.html>

------------------------------

Message: 5
Date: Mon, 28 Jan 2019 16:48:22 +0000
From: Gil Tene <[hidden email]>
To: Alex Otenko <[hidden email]>
Cc: Valentin Kovalenko <[hidden email]>,
        concurrency-interest <[hidden email]>
Subject: Re: [concurrency-interest] getAndIncrement with backoff
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="utf-8"

The key difference is that, on architectures that support atomic increment or atomic add instructions, getAndIncrement() can (and should) be implemented as a wait-free construct. For various concurrent mechanisms that would use this construct, this makes the difference between being wait free and lock free. And that difference is generally much more important and valuable than the (potentially non-forward-progress-guaranteeing-for-each-thread) throughout benefit of the primitive itself under highly contended situations.

Note that e.g. x86’s lock xadd is wait free in implementation. But so is the LOCK CAS instruction, and so are LL and SC instructions on any processor that support them. AFAIK all user mode instructions are wait free, by definition, and must complete in a small and reasonable bound as a fundamental requirement of proper hardware design (this quality is easy to deduce from basic expectations we have of all hardware with user mode instructions: e.g. just think of the obvious system-hanging user-mode DoS attack if any non-wait-free user mode instructions actually existed).

Any looping or redo logic is always a pure software implementation detail. Whether or not a given construct can be implemented to be wait free depends on the wait-free primitives it has at its disposal (which in turn depend on the capabilities of hardware instructions available).

For the basic atomic constructs (like the various atomic operations offered by the JDK for e.g. AtomicLong) the choice should always be to implement the construct in a wait-free manner where possible. Wherever we don’t, we hobble the ability of concurrent mechanisms written using those constructs to be wait-free.

Sent from my iPad

On Jan 28, 2019, at 1:25 AM, Alex Otenko via Concurrency-interest <[hidden email]<mailto:[hidden email]>> wrote:

Last time I heard this discussed the conclusion was that there isn’t a universal backoff strategy that suits all platforms and cases.

For example, the logic of backoff used in this test case is good for continuous atomic incrementing. Will the CPUs really be spinning in a trivial increment loop? More likely you’ll slam the hot spot with multiple CPUs at once, then the threads go away to do things, then come back. Backing off the way it’s done in the test is not necessarily going to help - the win comes from one thread being able to do several increments while the others do not contend to do the same; but if the threads are not attempting to do several increments, backoff is going to work like the usual CAS loop.

lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).

Alex

On 28 Jan 2019, at 08:49, Valentin Kovalenko <[hidden email]<mailto:[hidden email]>> wrote:

> The reason for not using it for backoff is because there is no backoff in the intrinsic
So lock:xadd does not have backoff logic, may it be then a good idea to not use this intrinsic and use "manual" getAndIncrement implementation with backoff in JDK since it results in a significantly higher throughput? In the mentioned article authors tested not only multiple architectures with many CPU threads (not 4 like I do), but also CAS-based data structures, and they achieved throughput improvements with backoff in all tested cases.

> The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.
Sure, since threads retry CAS less often once encounter failures, this effectively reduces contention.

Valentin

On Mon, 28 Jan 2019 at 01:17, Alex Otenko <[hidden email]<mailto:[hidden email]>> wrote:
The intrinsic probably compiles to lock:xadd on x86, which has the getAndAdd semantics.

The reason for not using it for backoff is because there is no backoff in the intrinsic. The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.


Alex

On 28 Jan 2019, at 07:29, Valentin Kovalenko via Concurrency-interest <[hidden email]<mailto:[hidden email]>> wrote:

Hi everyone,

It seems to be common knowledge (see "Lightweight Contention Management for E cient Compare-and-Swap Operations", https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff drastically increases the throughput of CAS-based operations (e.g. getAndIncrement). I checked this on my own, and yes, this is still true for OpenJDK 11:
(the test and full measurements is available here https://github.com/stIncMale/sandbox/blob/master/benchmarks/src/test/java/stincmale/sandbox/benchmarks/AtomicApiComparisonTest.java)
[single CPU] 3.4 GHz Intel Core i5 (4 cores)
[OS] macOS 10.13.6 (17G4015)
[JDK] OpenJDK 11.0.1+13

1 thread                                Mode  Cnt    Score   Error  Units
atomicLongGetAndIncrement              thrpt   45  139.645 ± 1.383  ops/us
atomicLongGetAndIncrementManual        thrpt   45  111.469 ± 0.665  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  111.545 ± 0.663  ops/us
4 threads
atomicLongGetAndIncrement              thrpt   45   50.892 ± 0.131  ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.435 ± 0.162  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45   92.659 ± 1.459  ops/us
32 threads
atomicLongGetAndIncrement              thrpt   45   51.844 ±  0.803 ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.495 ±  0.331 ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  157.771 ± 19.528 ops/us

The fact that atomicLongGetAndIncrement is faster than atomicLongGetAndIncrementManual for 1 thread tells me that as a result of @HotSpotIntrinsicCandidate this method at runtime uses an implementation different from what could have been compiled from Java sources, and this different implementation has nothing to do with backoff because with 1 thread there are no CAS failures.

What are the reasons behind the choice to not use backoff in OpenJDK's AtomicLong.getAndIncrement?

Regards,
Valentin
[LinkedIn]<https://www.linkedin.com/in/stIncMale>   [GitHub] <https://github.com/stIncMale>    [YouTube] <https://www.youtube.com/user/stIncMale>
_______________________________________________
Concurrency-interest mailing list
[hidden email]<mailto:[hidden email]>
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]<mailto:[hidden email]>
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20190128/f692926e/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


------------------------------

End of Concurrency-interest Digest, Vol 167, Issue 12
*****************************************************

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
@valentin
Interestingly, atomics documentation (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/atomic/package-summary.html) only guarantees lock freedom, and does not even mention wait freedom

It makes sense, considering that not all the architectures provide a wait-free instr for it


Il giorno lun 28 gen 2019 alle ore 18:42 Valentin Kovalenko via Concurrency-interest <[hidden email]> ha scritto:
[Andrew Haley] > I can't build this stuff.
> Actually, I got it done, as long as I use exactly jdk11. No idea how to run the test though. A simple JMH test would have helped a lot.

This is, in fact, a plain JMH test :) It can be run by executing
mvn clean test -f benchmarks/pom.xml -Dstincmale.sandbox.benchmarks.dryRun=false -Dtest=AtomicApiComparisonTest

The only non-JMH part is that I start it via Junit5 runner, but one may simply add a main method which would call throughputThreads4/throughputThreads32 methods directly.

[Gil Tene] > x86’s lock xadd is wait free in implementation
Oh, I didn't know this. Yes, the manual retry cycle with CAS is only lock-free. Interestingly, atomics documentation (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/atomic/package-summary.html) only guarantees lock freedom, and does not even mention wait freedom. So we can't exactly rely on it anyway.

Valentin

On Mon, 28 Jan 2019 at 09:53, <[hidden email]> wrote:
Send Concurrency-interest mailing list submissions to
        [hidden email]

To subscribe or unsubscribe via the World Wide Web, visit
        http://cs.oswego.edu/mailman/listinfo/concurrency-interest
or, via email, send a message with subject or body 'help' to
        [hidden email]

You can reach the person managing the list at
        [hidden email]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Concurrency-interest digest..."


Today's Topics:

   1. Re: getAndIncrement with backoff (Alex Otenko)
   2. Re: getAndIncrement with backoff (Andrew Haley)
   3. Re: getAndIncrement with backoff (Andrew Haley)
   4. Re: getAndIncrement with backoff (Nathan and Ila Reynolds)
   5. Re: getAndIncrement with backoff (Gil Tene)


----------------------------------------------------------------------

Message: 1
Date: Mon, 28 Jan 2019 12:52:54 +0000
From: Alex Otenko <[hidden email]>
To: Valentin Kovalenko <[hidden email]>
Cc: concurrency-interest <[hidden email]>
Subject: Re: [concurrency-interest] getAndIncrement with backoff
Message-ID: <[hidden email]>
Content-Type: text/plain;       charset=utf-8

Well, you can take a look at the locks used for synchronized. The logic there is aiming at precisely the same cases: should it bias to one thread, should it be a spin-lock, should it inflate to a "fat lock" (OS-provided primitive). But each lock behaves differently, so the JVM makes different decisions for each individual lock.

The atomic increment intrinsics will suffer from the same - each hot spot will require different treatment (but then you also have to come up with, and justify the cost of, added tracking for contention of individual atomic longs).

Alex


> On 28 Jan 2019, at 11:46, Valentin Kovalenko via Concurrency-interest <[hidden email]> wrote:
>
> > there isn’t a universal backoff strategy that suits all platforms and cases
>
> It does not seem a problem to have different intrinsic backoff strategies

> for different platforms, but this, of course, can't be done for different
> cases. However, since backoff only comes to play in case of a CAS failure
> (or even multiple failures), it should neither affect throughput nor
> latency for a low contention scenario. But Francesco provided an

> explanation of how this still can negatively affect the performance
> (though, I, unfortunately, didn't understand it). Francesco, may I ask you
> to explain the same thing for a less educated audience?:)
>
> > lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).
>
> So basically this is the same idea we usually use in programming: rely on
> the underlying layer's implementation if it is provided. And the underlying
> layer is hardware in this case.
>
> Valentin

> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest



------------------------------

Message: 2
Date: Mon, 28 Jan 2019 14:32:55 +0000
From: Andrew Haley <[hidden email]>
To: Valentin Kovalenko <[hidden email]>,
        concurrency-interest <[hidden email]>
Subject: Re: [concurrency-interest] getAndIncrement with backoff
Message-ID: <[hidden email]>
Content-Type: text/plain; charset=utf-8


On 1/28/19 7:29 AM, Valentin Kovalenko via Concurrency-interest wrote:
> (the test and full measurements is available here https://github.com/stIncMale/sandbox/blob/master/benchmarks/src/test/java/stincmale/sandbox/benchmarks/AtomicApiComparisonTest.java)

I can't build this stuff.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


------------------------------

Message: 3
Date: Mon, 28 Jan 2019 14:50:03 +0000
From: Andrew Haley <[hidden email]>
To: Valentin Kovalenko <[hidden email]>,
        concurrency-interest <[hidden email]>
Subject: Re: [concurrency-interest] getAndIncrement with backoff
Message-ID: <[hidden email]>
Content-Type: text/plain; charset=utf-8


On 1/28/19 2:32 PM, Andrew Haley via Concurrency-interest wrote:
> On 1/28/19 7:29 AM, Valentin Kovalenko via Concurrency-interest wrote:
>> (the test and full measurements is available here https://github.com/stIncMale/sandbox/blob/master/benchmarks/src/test/java/stincmale/sandbox/benchmarks/AtomicApiComparisonTest.java)
>
> I can't build this stuff.

Actually, I got it done, as long as I use exactly jdk11. No idea how to
run the test though. A simple JMH test would have helped a lot.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


------------------------------

Message: 4
Date: Mon, 28 Jan 2019 08:48:28 -0700
From: Nathan and Ila Reynolds <[hidden email]>
To: [hidden email]
Subject: Re: [concurrency-interest] getAndIncrement with backoff
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="utf-8"; Format="flowed"

Here is a clarification about lock:xadd on Intel x86 platforms. The
hardware does **not** execute the operation in a CAS-like loop with a
retry.  The hardware thread retrieves the cache line in the exclusive
state, loads the value, does the add and stores the value in the cache
line.  While the operation is progressing, the core will not allow the
cache line to be modified by another hardware thread.  Hence, there is
no chance for failure or a retry.

Here is a typical CAS loop I write in software...

do
{
    expect = m_value.get();
    update = compute(expect);
}
while (!m_value.compareAndSet(expect, update));

The hardware thread will retrieve the cache line in the shared state
during m_value.get().  The cache line is the vulnerable to being
modified by another hardware thread during compute(expect). The hardware
thread then executes compareAndSet().  It changes the cache line to the
exclusive state, loads the value, compares the value to expect and if
equal, stores update.  While the CAS instruction is executing, no other
hardware thread can modify the cache line just like lock:xadd.  A
backoff supposedly can help with reducing the number of times
compute(expect) is executed.

There is another concern.  m_value.get() retrieves the cache line in the
shared state which costs a bus operation and then compareAndSet() has to
change the cache line to the exclusive state which costs another more
expensive bus operation.  Think about this code.

do
{
    m_nearby = 0;
    expect   = m_value.get();
    update   = compute(expect);
}
while (!m_value.compareAndSet(expect, update));

The idea is to retrieve the cache line in the exclusive state to begin
with by writing to the cache line at the beginning of the loop (i.e.
m_nearby = 0).  This will reduce the number of bus operations.  I have
never tried this but it might perform better.

If compute() is a long operation, an idea is to use transactional
memory.  If there is a conflict, then the transaction aborts immediately
and the hardware thread starts over immediately.  This way the hardware
thread does not waste time finishing the execution of compute() when the
store to m_value will definitely fail.

beginTransaction();

m_value= compute(m_value);

commitTransaction();

-Nathan


On 1/28/2019 4:46 AM, Valentin Kovalenko via Concurrency-interest wrote:
> > there isn’t a universal backoff strategy that suits all platforms
> and cases
>
> It does not seem a problem to have different intrinsic backoff strategies

> for different platforms, but this, of course, can't be done for different
> cases. However, since backoff only comes to play in case of a CAS failure
> (or even multiple failures), it should neither affect throughput nor
> latency for a low contention scenario. But Francesco provided an

> explanation of how this still can negatively affect the performance
> (though, I, unfortunately, didn't understand it). Francesco, may I ask you
> to explain the same thing for a less educated audience?:)
>
> > lock:xadd defers this to the hardware. The instruction has no “redo”
> - but of course internally at hardware level it probably is still a
> CAS-like loop with a retry - and a backoff (I think all consensus
> protocols have corner cases that can only be resolved through timing a
> race).
>
> So basically this is the same idea we usually use in programming: rely on

> the underlying layer's implementation if it is provided. And the
> underlying
> layer is hardware in this case.
>
> Valentin

>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20190128/ac3b5bdc/attachment-0001.html>

------------------------------

Message: 5
Date: Mon, 28 Jan 2019 16:48:22 +0000
From: Gil Tene <[hidden email]>
To: Alex Otenko <[hidden email]>
Cc: Valentin Kovalenko <[hidden email]>,
        concurrency-interest <[hidden email]>
Subject: Re: [concurrency-interest] getAndIncrement with backoff
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="utf-8"


The key difference is that, on architectures that support atomic increment or atomic add instructions, getAndIncrement() can (and should) be implemented as a wait-free construct. For various concurrent mechanisms that would use this construct, this makes the difference between being wait free and lock free. And that difference is generally much more important and valuable than the (potentially non-forward-progress-guaranteeing-for-each-thread) throughout benefit of the primitive itself under highly contended situations.

Note that e.g. x86’s lock xadd is wait free in implementation. But so is the LOCK CAS instruction, and so are LL and SC instructions on any processor that support them. AFAIK all user mode instructions are wait free, by definition, and must complete in a small and reasonable bound as a fundamental requirement of proper hardware design (this quality is easy to deduce from basic expectations we have of all hardware with user mode instructions: e.g. just think of the obvious system-hanging user-mode DoS attack if any non-wait-free user mode instructions actually existed).

Any looping or redo logic is always a pure software implementation detail. Whether or not a given construct can be implemented to be wait free depends on the wait-free primitives it has at its disposal (which in turn depend on the capabilities of hardware instructions available).

For the basic atomic constructs (like the various atomic operations offered by the JDK for e.g. AtomicLong) the choice should always be to implement the construct in a wait-free manner where possible. Wherever we don’t, we hobble the ability of concurrent mechanisms written using those constructs to be wait-free.

Sent from my iPad

On Jan 28, 2019, at 1:25 AM, Alex Otenko via Concurrency-interest <[hidden email]<mailto:[hidden email]>> wrote:

Last time I heard this discussed the conclusion was that there isn’t a universal backoff strategy that suits all platforms and cases.

For example, the logic of backoff used in this test case is good for continuous atomic incrementing. Will the CPUs really be spinning in a trivial increment loop? More likely you’ll slam the hot spot with multiple CPUs at once, then the threads go away to do things, then come back. Backing off the way it’s done in the test is not necessarily going to help - the win comes from one thread being able to do several increments while the others do not contend to do the same; but if the threads are not attempting to do several increments, backoff is going to work like the usual CAS loop.

lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).

Alex

On 28 Jan 2019, at 08:49, Valentin Kovalenko <[hidden email]<mailto:[hidden email]>> wrote:

> The reason for not using it for backoff is because there is no backoff in the intrinsic
So lock:xadd does not have backoff logic, may it be then a good idea to not use this intrinsic and use "manual" getAndIncrement implementation with backoff in JDK since it results in a significantly higher throughput? In the mentioned article authors tested not only multiple architectures with many CPU threads (not 4 like I do), but also CAS-based data structures, and they achieved throughput improvements with backoff in all tested cases.

> The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.
Sure, since threads retry CAS less often once encounter failures, this effectively reduces contention.

Valentin

On Mon, 28 Jan 2019 at 01:17, Alex Otenko <[hidden email]<mailto:[hidden email]>> wrote:
The intrinsic probably compiles to lock:xadd on x86, which has the getAndAdd semantics.

The reason for not using it for backoff is because there is no backoff in the intrinsic. The reason for getting different results is possibly sharing the cache: backoff allows one CPU to usurp the cache line and make progress; without the backoff the line is always shared between all the CPUs.


Alex

On 28 Jan 2019, at 07:29, Valentin Kovalenko via Concurrency-interest <[hidden email]<mailto:[hidden email]>> wrote:

Hi everyone,

It seems to be common knowledge (see "Lightweight Contention Management for E cient Compare-and-Swap Operations", https://arxiv.org/pdf/1305.5800.pdf) that simple exponential backoff drastically increases the throughput of CAS-based operations (e.g. getAndIncrement). I checked this on my own, and yes, this is still true for OpenJDK 11:
(the test and full measurements is available here https://github.com/stIncMale/sandbox/blob/master/benchmarks/src/test/java/stincmale/sandbox/benchmarks/AtomicApiComparisonTest.java)
[single CPU] 3.4 GHz Intel Core i5 (4 cores)
[OS] macOS 10.13.6 (17G4015)
[JDK] OpenJDK 11.0.1+13

1 thread                                Mode  Cnt    Score   Error  Units
atomicLongGetAndIncrement              thrpt   45  139.645 ± 1.383  ops/us
atomicLongGetAndIncrementManual        thrpt   45  111.469 ± 0.665  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  111.545 ± 0.663  ops/us
4 threads
atomicLongGetAndIncrement              thrpt   45   50.892 ± 0.131  ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.435 ± 0.162  ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45   92.659 ± 1.459  ops/us
32 threads
atomicLongGetAndIncrement              thrpt   45   51.844 ±  0.803 ops/us
atomicLongGetAndIncrementManual        thrpt   45   12.495 ±  0.331 ops/us
atomicLongGetAndIncrementManualBackoff thrpt   45  157.771 ± 19.528 ops/us

The fact that atomicLongGetAndIncrement is faster than atomicLongGetAndIncrementManual for 1 thread tells me that as a result of @HotSpotIntrinsicCandidate this method at runtime uses an implementation different from what could have been compiled from Java sources, and this different implementation has nothing to do with backoff because with 1 thread there are no CAS failures.

What are the reasons behind the choice to not use backoff in OpenJDK's AtomicLong.getAndIncrement?

Regards,
Valentin
[LinkedIn]<https://www.linkedin.com/in/stIncMale>   [GitHub] <https://github.com/stIncMale>    [YouTube] <https://www.youtube.com/user/stIncMale>
_______________________________________________
Concurrency-interest mailing list
[hidden email]<mailto:[hidden email]>
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]<mailto:[hidden email]>
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs.oswego.edu/pipermail/concurrency-interest/attachments/20190128/f692926e/attachment.html>

------------------------------

Subject: Digest Footer


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


------------------------------

End of Concurrency-interest Digest, Vol 167, Issue 12
*****************************************************
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
12