Re: Concurrency-interest Digest, Vol 167, Issue 10

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Concurrency-interest Digest, Vol 167, Issue 10

JSR166 Concurrency mailing list
there isn’t a universal backoff strategy that suits all platforms and cases
It does not seem a problem to have different intrinsic backoff strategies for different platforms, but this, of course, can't be done for different cases. However, since backoff only comes to play in case of a CAS failure (or even multiple failures), it should neither affect throughput nor latency for a low contention scenario. But Francesco provided an explanation of how this still can negatively affect the performance (though, I, unfortunately, didn't understand it). Francesco, may I ask you to explain the same thing for a less educated audience?:)

lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).
So basically this is the same idea we usually use in programming: rely on the underlying layer's implementation if it is provided. And the underlying layer is hardware in this case.

Valentin

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

getAndIncrement with backoff

JSR166 Concurrency mailing list
> there isn’t a universal backoff strategy that suits all platforms and cases

It does not seem a problem to have different intrinsic backoff strategies
for different platforms, but this, of course, can't be done for different
cases. However, since backoff only comes to play in case of a CAS failure
(or even multiple failures), it should neither affect throughput nor
latency for a low contention scenario. But Francesco provided an
explanation of how this still can negatively affect the performance
(though, I, unfortunately, didn't understand it). Francesco, may I ask you
to explain the same thing for a less educated audience?:)

> lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).

So basically this is the same idea we usually use in programming: rely on
the underlying layer's implementation if it is provided. And the underlying
layer is hardware in this case.

Valentin

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
Well, you can take a look at the locks used for synchronized. The logic there is aiming at precisely the same cases: should it bias to one thread, should it be a spin-lock, should it inflate to a "fat lock" (OS-provided primitive). But each lock behaves differently, so the JVM makes different decisions for each individual lock.

The atomic increment intrinsics will suffer from the same - each hot spot will require different treatment (but then you also have to come up with, and justify the cost of, added tracking for contention of individual atomic longs).

Alex

> On 28 Jan 2019, at 11:46, Valentin Kovalenko via Concurrency-interest <[hidden email]> wrote:
>
> > there isn’t a universal backoff strategy that suits all platforms and cases
>
> It does not seem a problem to have different intrinsic backoff strategies
> for different platforms, but this, of course, can't be done for different
> cases. However, since backoff only comes to play in case of a CAS failure
> (or even multiple failures), it should neither affect throughput nor
> latency for a low contention scenario. But Francesco provided an
> explanation of how this still can negatively affect the performance
> (though, I, unfortunately, didn't understand it). Francesco, may I ask you
> to explain the same thing for a less educated audience?:)
>
> > lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).
>
> So basically this is the same idea we usually use in programming: rely on
> the underlying layer's implementation if it is provided. And the underlying
> layer is hardware in this case.
>
> Valentin
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list

Here is a clarification about lock:xadd on Intel x86 platforms.  The hardware does *not* execute the operation in a CAS-like loop with a retry.  The hardware thread retrieves the cache line in the exclusive state, loads the value, does the add and stores the value in the cache line.  While the operation is progressing, the core will not allow the cache line to be modified by another hardware thread.  Hence, there is no chance for failure or a retry.

Here is a typical CAS loop I write in software...

do
{
   expect = m_value.get();
   update = compute(expect);
}
while (!m_value.compareAndSet(expect, update));

The hardware thread will retrieve the cache line in the shared state during m_value.get().  The cache line is the vulnerable to being modified by another hardware thread during compute(expect).  The hardware thread then executes compareAndSet().  It changes the cache line to the exclusive state, loads the value, compares the value to expect and if equal, stores update.  While the CAS instruction is executing, no other hardware thread can modify the cache line just like lock:xadd.  A backoff supposedly can help with reducing the number of times compute(expect) is executed.

There is another concern.  m_value.get() retrieves the cache line in the shared state which costs a bus operation and then compareAndSet() has to change the cache line to the exclusive state which costs another more expensive bus operation.  Think about this code.

do
{
   m_nearby = 0;
   expect   = m_value.get();
   update   = compute(expect);
}
while (!m_value.compareAndSet(expect, update));

The idea is to retrieve the cache line in the exclusive state to begin with by writing to the cache line at the beginning of the loop (i.e. m_nearby = 0).  This will reduce the number of bus operations.  I have never tried this but it might perform better.

If compute() is a long operation, an idea is to use transactional memory.  If there is a conflict, then the transaction aborts immediately and the hardware thread starts over immediately.  This way the hardware thread does not waste time finishing the execution of compute() when the store to m_value will definitely fail.

beginTransaction();

m_value = compute(m_value);

commitTransaction();
-Nathan
On 1/28/2019 4:46 AM, Valentin Kovalenko via Concurrency-interest wrote:
> there isn’t a universal backoff strategy that suits all platforms and cases

It does not seem a problem to have different intrinsic backoff strategies
for different platforms, but this, of course, can't be done for different
cases. However, since backoff only comes to play in case of a CAS failure
(or even multiple failures), it should neither affect throughput nor
latency for a low contention scenario. But Francesco provided an
explanation of how this still can negatively affect the performance
(though, I, unfortunately, didn't understand it). Francesco, may I ask you
to explain the same thing for a less educated audience?:)

> lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).

So basically this is the same idea we usually use in programming: rely on
the underlying layer's implementation if it is provided. And the underlying
layer is hardware in this case.

Valentin

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
So, I would like to sum up all the ideas. Please, correct/complete me if I missed something from the discussion.

JDK does not use backoff or any other logic which may sometimes increase the throughput of CAS-based actions (e.g. getAndIncrement) because:
  1. Some hardware provides wait-free implementations, which means that once we saturate all CPU threads, the throughput of CAS will not degrade with the growth of software threads (Gil, http://cs.oswego.edu/pipermail/concurrency-interest/2019-January/016784.html; Nathan and Ila, http://cs.oswego.edu/pipermail/concurrency-interest/2019-January/016783.html)
  2. In many real cases, backoff or other logic reducing CAS contention will likely have no positive effect unless threads do almost nothing else but CAS-ing. And the overhead of trying to pick the best strategy for each situation is non-justifiable (Alex, http://cs.oswego.edu/pipermail/concurrency-interest/2019-January/016769.htmlhttp://cs.oswego.edu/pipermail/concurrency-interest/2019-January/016780.html)
  3. backoff logic introduces more possibilities for a thread to wait for safepoint poll thus potentially drastically increasing max latencies (Franc, http://cs.oswego.edu/pipermail/concurrency-interest/2019-January/016779.html)

Valentin

On Mon, 28 Jan 2019 at 05:52, Alex Otenko <[hidden email]> wrote:
Well, you can take a look at the locks used for synchronized. The logic there is aiming at precisely the same cases: should it bias to one thread, should it be a spin-lock, should it inflate to a "fat lock" (OS-provided primitive). But each lock behaves differently, so the JVM makes different decisions for each individual lock.

The atomic increment intrinsics will suffer from the same - each hot spot will require different treatment (but then you also have to come up with, and justify the cost of, added tracking for contention of individual atomic longs).

Alex

> On 28 Jan 2019, at 11:46, Valentin Kovalenko via Concurrency-interest <[hidden email]> wrote:
>
> > there isn’t a universal backoff strategy that suits all platforms and cases
>
> It does not seem a problem to have different intrinsic backoff strategies
> for different platforms, but this, of course, can't be done for different
> cases. However, since backoff only comes to play in case of a CAS failure
> (or even multiple failures), it should neither affect throughput nor
> latency for a low contention scenario. But Francesco provided an
> explanation of how this still can negatively affect the performance
> (though, I, unfortunately, didn't understand it). Francesco, may I ask you
> to explain the same thing for a less educated audience?:)
>
> > lock:xadd defers this to the hardware. The instruction has no “redo” - but of course internally at hardware level it probably is still a CAS-like loop with a retry - and a backoff (I think all consensus protocols have corner cases that can only be resolved through timing a race).
>
> So basically this is the same idea we usually use in programming: rely on
> the underlying layer's implementation if it is provided. And the underlying
> layer is hardware in this case.
>
> Valentin
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: getAndIncrement with backoff

JSR166 Concurrency mailing list
Sorry, the thing in the item 1 about throughput is true for both lock-free and wait- free, the difference is in the max latencies: it is bound for a wait-fee implementation.

On Mon, Jan 28, 2019, 11:13 Valentin Kovalenko <[hidden email] wrote:
So, I would like to sum up all the ideas. Please, correct/complete me if I missed something from the discussion.

JDK does not use backoff or any other logic which may sometimes increase the throughput of CAS-based actions (e.g. getAndIncrement) because:
  1. Some hardware provides wait-free implementations, which means that once we saturate all CPU threads, the throughput of CAS will not degrade with the growth of software threads (Gil, http://cs.oswego.edu/pipermail/concurrency-interest/2019-January/016784.html; Nathan and Ila, http://cs.oswego.edu/pipermail/concurrency-interest/2019-January/016783.html)
  2. In many real cases, backoff or other logic reducing CAS contention will likely have no positive effect unless threads do almost nothing else but CAS-ing. And the overhead of trying to pick the best strategy for each situation is non-justifiable (Alex, http://cs.oswego.edu/pipermail/concurrency-interest/2019-January/016769.htmlhttp://cs.oswego.edu/pipermail/concurrency-interest/2019-January/016780.html)
  3. backoff logic introduces more possibilities for a thread to wait for safepoint poll thus potentially drastically increasing max latencies (Franc, http://cs.oswego.edu/pipermail/concurrency-interest/2019-January/016779.html)

Valentin

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Concurrency-interest Digest, Vol 167, Issue 10

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 1/28/19 11:43 AM, Valentin Kovalenko via Concurrency-interest wrote:
>> there isn’t a universal backoff strategy that suits all platforms and cases

Here's one result:

Benchmark                                                       Mode  Cnt         Score          Error  Units
AtomicApiComparisonTest.atomicLongGetAndIncrement               avgt    5      1123.350 ±     1121.332  ns/op
AtomicApiComparisonTest.atomicLongGetAndIncrementManual         avgt    5       461.725 ±      370.705  ns/op
AtomicApiComparisonTest.atomicLongGetAndIncrementManualBackoff  avgt    5  27382411.415 ± 96039209.596  ns/op

If I had to guess, it'd be that the threads are backing off from each other in perfect
synchronization, and the CAS never completes.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest