The very best CAS loop

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

The very best CAS loop

Martin Buchholz-3
Discussion on CAS loops got me looking again at our own:

    public final V getAndUpdate(UnaryOperator<V> updateFunction) {
        V prev = get(), next = null;
        for (boolean haveNext = false;;) {
            if (!haveNext)
                next = updateFunction.apply(prev);
            if (weakCompareAndSetVolatile(prev, next))
                return prev;
            haveNext = (prev == (prev = get()));
        }
    }

The haveNext boolean and useless initialization of next bothers me.  We can do better!

    public final V getAndUpdate(UnaryOperator<V> updateFunction) {
        for (V prev = get();;) {
            V next = updateFunction.apply(prev);
            do {
                if (weakCompareAndSetVolatile(prev, next))
                    return prev;
            } while (prev == (prev = get()));
        }
    }

even though it probably saves more bytecodes than cycles.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Dávid Karnok
Isn't the second one getting 2 safepoint polls when the weak CAS failed due to different actually changed value?

2016-09-24 18:51 GMT+02:00 Martin Buchholz <[hidden email]>:
Discussion on CAS loops got me looking again at our own:

    public final V getAndUpdate(UnaryOperator<V> updateFunction) {
        V prev = get(), next = null;
        for (boolean haveNext = false;;) {
            if (!haveNext)
                next = updateFunction.apply(prev);
            if (weakCompareAndSetVolatile(prev, next))
                return prev;
            haveNext = (prev == (prev = get()));
        }
    }

The haveNext boolean and useless initialization of next bothers me.  We can do better!

    public final V getAndUpdate(UnaryOperator<V> updateFunction) {
        for (V prev = get();;) {
            V next = updateFunction.apply(prev);
            do {
                if (weakCompareAndSetVolatile(prev, next))
                    return prev;
            } while (prev == (prev = get()));
        }
    }

even though it probably saves more bytecodes than cycles.



--
Best regards,
David Karnok

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Martin Buchholz-3

On Sat, Sep 24, 2016 at 1:34 PM, Dávid Karnok <[hidden email]> wrote:
Isn't the second one getting 2 safepoint polls when the weak CAS failed due to different actually changed value?

I hope not; the tailing bytecodes are

      33: lcmp
      34: ifeq          14
      37: goto          5

 I expect a safepoint on a backward jump, and only one is taken, so one safepoint per failed CAS?

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Romain Colle

Out of curiosity, why are we using a weak CAS instead of a regular one in this method?

Thanks,
Romain

On Sat, Sep 24, 2016 at 11:24pm, Martin Buchholz <[hidden email]> wrote:


On Sat, Sep 24, 2016 at 1:34 PM, Dávid Karnok <[hidden email]> wrote:
Isn't the second one getting 2 safepoint polls when the weak CAS failed due to different actually changed value?

I hope not; the tailing bytecodes are

33: lcmp
34: ifeq 14
37: goto 5

I expect a safepoint on a backward jump, and only one is taken, so one safepoint per failed CAS?

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Andrew Haley
On 25/09/16 07:45, Romain Colle wrote:
> Out of curiosity, why are we using a weak CAS instead of a regular one in this method?

It's an efficiency concern.  On machines which support weak CAS, a CAS
can fail because of contention (or a spurious wakeup; these can happen
for any reason) or because the variable actually changed.  On such
machines, a strong CAS is usually implemented as a loop which spins.
In that case it makes more sense to fetch the variable again and retry
the CAS.

All of this is perhaps easier to see with an assembly-language
example, which I can provide on request.  :-)

Andrew.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Aleksey Shipilev-3
In reply to this post by Martin Buchholz-3
On 09/24/2016 06:51 PM, Martin Buchholz wrote:

> Discussion on CAS loops got me looking again at our own:
>
>     public final V getAndUpdate(UnaryOperator<V> updateFunction) {
>         V prev = get(), next = null;
>         for (boolean haveNext = false;;) {
>             if (!haveNext)
>                 next = updateFunction.apply(prev);
>             if (weakCompareAndSetVolatile(prev, next))
>                 return prev;
>             haveNext = (prev == (prev = get()));
>         }
>     }
>
> The haveNext boolean and useless initialization of next bothers me.  We
> can do better!
>
>     public final V getAndUpdate(UnaryOperator<V> updateFunction) {
>         for (V prev = get();;) {
>             V next = updateFunction.apply(prev);
>             do {
>                 if (weakCompareAndSetVolatile(prev, next))
>                     return prev;
>             } while (prev == (prev = get()));
>         }
>     }
>
> even though it probably saves more bytecodes than cycles.
I don't quite get why do we need to retry on the same "prev" to begin
with: we are allowed to call function several times on the same value.
If so, there is a more canonical form:

     public final V getAndUpdate(UnaryOperator<V> updateFunction) {
         V prev, next;
         do {
             prev = get();
             next = updateFunction.apply(prev);
         } while (!weakCompareAndSetVolatile(prev, next));
         return prev;
     }

CAS-vs-CAE example shows us that additional branches in the hot CAS loop
would be detrimental for performance under contention (and an quick
update function here, probably). Accurate benchmarking would tell you a
story, we cannot do this by looking at the code.

Also, retrying on (spurious) failure when prev had not changed negates
the reason to use weakCAS: now you have just constructed the retry loop
that strong CAS does with LL/SC, and we have tried to avoid that with
weakCAS!

Thanks,
-Aleksey


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Martin Buchholz-3
Hi Aleksey,

My goals were to 
- have the shortest code path if the CAS succeeds the first time, which is hopefully the common case
- never call the update function multiple times in the case of spurious failure, only for real contention.  We don't know how expensive the update function is, and the number of times it is called is user-detectable.
- yes, I have sort-of reconstructed the retry loop that strong CAS does with LL/SC, BUT we can save the value obtained from get(), which does not happen in the emulation below.  We don't have multiple value return in java!

    public final V getAndUpdate(UnaryOperator<V> updateFunction) {
        for (;;) {
            V prev = get();
            V next = updateFunction.apply(prev);
            if (strongCas(prev, next))
                return prev;
        }
    }
    /** Implement strong CAS using weak CAS. */
    boolean strongCas(V expectedValue, V newValue) {
        while (get() == expectedValue) {
            if (weakCompareAndSetVolatile(expectedValue, newValue))
                return true;
        }
        return false;
    }


On Mon, Sep 26, 2016 at 12:20 AM, Aleksey Shipilev <[hidden email]> wrote:
On 09/24/2016 06:51 PM, Martin Buchholz wrote:
> Discussion on CAS loops got me looking again at our own:
>
>     public final V getAndUpdate(UnaryOperator<V> updateFunction) {
>         V prev = get(), next = null;
>         for (boolean haveNext = false;;) {
>             if (!haveNext)
>                 next = updateFunction.apply(prev);
>             if (weakCompareAndSetVolatile(prev, next))
>                 return prev;
>             haveNext = (prev == (prev = get()));
>         }
>     }
>
> The haveNext boolean and useless initialization of next bothers me.  We
> can do better!
>
>     public final V getAndUpdate(UnaryOperator<V> updateFunction) {
>         for (V prev = get();;) {
>             V next = updateFunction.apply(prev);
>             do {
>                 if (weakCompareAndSetVolatile(prev, next))
>                     return prev;
>             } while (prev == (prev = get()));
>         }
>     }
>
> even though it probably saves more bytecodes than cycles.

I don't quite get why do we need to retry on the same "prev" to begin
with: we are allowed to call function several times on the same value.
If so, there is a more canonical form:

     public final V getAndUpdate(UnaryOperator<V> updateFunction) {
         V prev, next;
         do {
             prev = get();
             next = updateFunction.apply(prev);
         } while (!weakCompareAndSetVolatile(prev, next));
         return prev;
     }

CAS-vs-CAE example shows us that additional branches in the hot CAS loop
would be detrimental for performance under contention (and an quick
update function here, probably). Accurate benchmarking would tell you a
story, we cannot do this by looking at the code.

Also, retrying on (spurious) failure when prev had not changed negates
the reason to use weakCAS: now you have just constructed the retry loop
that strong CAS does with LL/SC, and we have tried to avoid that with
weakCAS!

Thanks,
-Aleksey



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Andrew Haley
On 26/09/16 18:01, Martin Buchholz wrote:
> My goals were to
> - have the shortest code path if the CAS succeeds the first time, which is
> hopefully the common case
> - never call the update function multiple times in the case of spurious
> failure, only for real contention.  We don't know how expensive the update
> function is, and the number of times it is called is user-detectable.

That's probably suboptimal.  The ratio of truly spurious failures
(caused by e.g pre-emption) to actual contention is probably pretty
small, so if a SC fails it's probably ture contention, and the value
in memory really has changed.  In that case Aleksey's version is
better.

It's impossible to tell for sure because it depends on your actual
contention rate in your application.

Andrew.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Paul Sandoz
In reply to this post by Martin Buchholz-3

On 26 Sep 2016, at 10:01, Martin Buchholz <[hidden email]> wrote:

Hi Aleksey,

My goals were to 
- have the shortest code path if the CAS succeeds the first time, which is hopefully the common case
- never call the update function multiple times in the case of spurious failure, only for real contention.  We don't know how expensive the update function is, and the number of times it is called is user-detectable.
- yes, I have sort-of reconstructed the retry loop that strong CAS does with LL/SC, BUT we can save the value obtained from get(), which does not happen in the emulation below.  We don't have multiple value return in java!

    public final V getAndUpdate(UnaryOperator<V> updateFunction) {
        for (;;) {
            V prev = get();
            V next = updateFunction.apply(prev);
            if (strongCas(prev, next))
                return prev;
        }
    }
    /** Implement strong CAS using weak CAS. */
    boolean strongCas(V expectedValue, V newValue) {
        while (get() == expectedValue) {
            if (weakCompareAndSetVolatile(expectedValue, newValue))
                return true;
        }
        return false;
    }


If you are gonna do why not just call “compareAndSet" instead of emulating with your own “strongCas” method? and then, i presume, we are back to where we started.

I don’t think the “updateFunction” can tell a spurious failure from another failure given the variable, under contention, can change from and back to the previous between the “strongCas” call.
 
I prefer Aleksey’s version.

Paul.
 



On Mon, Sep 26, 2016 at 12:20 AM, Aleksey Shipilev <[hidden email]> wrote:
On 09/24/2016 06:51 PM, Martin Buchholz wrote:
> Discussion on CAS loops got me looking again at our own:
>
>     public final V getAndUpdate(UnaryOperator<V> updateFunction) {
>         V prev = get(), next = null;
>         for (boolean haveNext = false;;) {
>             if (!haveNext)
>                 next = updateFunction.apply(prev);
>             if (weakCompareAndSetVolatile(prev, next))
>                 return prev;
>             haveNext = (prev == (prev = get()));
>         }
>     }
>
> The haveNext boolean and useless initialization of next bothers me.  We
> can do better!
>
>     public final V getAndUpdate(UnaryOperator<V> updateFunction) {
>         for (V prev = get();;) {
>             V next = updateFunction.apply(prev);
>             do {
>                 if (weakCompareAndSetVolatile(prev, next))
>                     return prev;
>             } while (prev == (prev = get()));
>         }
>     }
>
> even though it probably saves more bytecodes than cycles.

I don't quite get why do we need to retry on the same "prev" to begin
with: we are allowed to call function several times on the same value.
If so, there is a more canonical form:

     public final V getAndUpdate(UnaryOperator<V> updateFunction) {
         V prev, next;
         do {
             prev = get();
             next = updateFunction.apply(prev);
         } while (!weakCompareAndSetVolatile(prev, next));
         return prev;
     }

CAS-vs-CAE example shows us that additional branches in the hot CAS loop
would be detrimental for performance under contention (and an quick
update function here, probably). Accurate benchmarking would tell you a
story, we cannot do this by looking at the code.

Also, retrying on (spurious) failure when prev had not changed negates
the reason to use weakCAS: now you have just constructed the retry loop
that strong CAS does with LL/SC, and we have tried to avoid that with
weakCAS!

Thanks,
-Aleksey




_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (858 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Martin Buchholz-3


On Mon, Sep 26, 2016 at 10:36 AM, Paul Sandoz <[hidden email]> wrote:

On 26 Sep 2016, at 10:01, Martin Buchholz <[hidden email]> wrote:

Hi Aleksey,

My goals were to 
- have the shortest code path if the CAS succeeds the first time, which is hopefully the common case
- never call the update function multiple times in the case of spurious failure, only for real contention.  We don't know how expensive the update function is, and the number of times it is called is user-detectable.
- yes, I have sort-of reconstructed the retry loop that strong CAS does with LL/SC, BUT we can save the value obtained from get(), which does not happen in the emulation below.  We don't have multiple value return in java!

    public final V getAndUpdate(UnaryOperator<V> updateFunction) {
        for (;;) {
            V prev = get();
            V next = updateFunction.apply(prev);
            if (strongCas(prev, next))
                return prev;
        }
    }
    /** Implement strong CAS using weak CAS. */
    boolean strongCas(V expectedValue, V newValue) {
        while (get() == expectedValue) {
            if (weakCompareAndSetVolatile(expectedValue, newValue))
                return true;
        }
        return false;
    }


If you are gonna do why not just call “compareAndSet" instead of emulating with your own “strongCas” method? and then, i presume, we are back to where we started.


That code was just demonstrating how the return value from get() was getting lost, whereas if you inline a weak cas loop as I have done, you save the extra call to get().
 
I don’t think the “updateFunction” can tell a spurious failure from another failure given the variable, under contention, can change from and back to the previous between the “strongCas” call.

I'm imagining user code that counts all the calls to getAndUpdate, and all the calls to updateFunction.  If they differ, I would like that to be evidence of true contention.   True, we tell users updateFunction is supposed to be side-effect-free.  OTOH, we say that attempted updates may fail due to "contention among threads", suggesting we don't call updateFunction spuriously.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Peter Levart
Hi,

I don't know about you, but on my i7 PC, I get consistently better results using strong CAS as opposed to weak CAS for the following benchmark:

http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/GetAndUpdateBench.java

The results:

Benchmark                                      Mode  Cnt   Score   Error  Units
GetAndUpdateBench.dflt                         avgt   20  53.947 ± 0.770  ns/op
GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  54.103 ± 1.083  ns/op
GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  53.791 ± 1.014  ns/op
GetAndUpdateBench.martin                       avgt   20  52.724 ± 0.218  ns/op
GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  52.379 ± 0.972  ns/op
GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  53.070 ± 1.005  ns/op
GetAndUpdateBench.shade                        avgt   20  53.437 ± 0.624  ns/op
GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  53.401 ± 0.615  ns/op
GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  53.474 ± 0.652  ns/op
GetAndUpdateBench.strong                       avgt   20  38.090 ± 0.727  ns/op
GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  37.784 ± 1.633  ns/op
GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  38.397 ± 1.691  ns/op


The benchmark exhibits 2 threads that don't have data contention (each of them is modifying its own variable), but most probably those threads do modify the same cache line. The update function does have a non-zero cost.

To complement above results, I also ran the same code on Raspbery PI 2 with Oracle build of JDK 9 (b137):

Benchmark                                      Mode  Cnt     Score   Error  Units
GetAndUpdateBench.dflt                         avgt   20  2347.811 ± 2.492  ns/op
GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  2347.328 ± 6.697  ns/op
GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  2348.294 ± 7.052  ns/op
GetAndUpdateBench.martin                       avgt   20  2342.773 ± 5.618  ns/op
GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  2342.398 ± 8.662  ns/op
GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  2343.148 ± 9.514  ns/op
GetAndUpdateBench.shade                        avgt   20  1844.426 ± 4.304  ns/op
GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  1841.747 ± 5.287  ns/op
GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  1847.105 ± 6.416  ns/op
GetAndUpdateBench.strong                       avgt   20  1846.413 ± 1.155  ns/op
GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  1846.312 ± 8.095  ns/op
GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  1846.513 ± 8.110  ns/op


Here we may be observing a strange phenomena. Reexecuting the non-zero-cost update function on spurious failure (shade) surprisingly gives better throughput than immediately retrying weak CAS (dflt, martin).

Also, wasn't strong CAS on ARM supposed to be mapped to a weak CAS loop? Why the difference between dflt/martin and strong then?

You've got something to study now... ;-)

Regards, Peter



On 09/26/2016 07:51 PM, Martin Buchholz wrote:


On Mon, Sep 26, 2016 at 10:36 AM, Paul Sandoz <[hidden email]> wrote:

On 26 Sep 2016, at 10:01, Martin Buchholz <[hidden email]> wrote:

Hi Aleksey,

My goals were to 
- have the shortest code path if the CAS succeeds the first time, which is hopefully the common case
- never call the update function multiple times in the case of spurious failure, only for real contention.  We don't know how expensive the update function is, and the number of times it is called is user-detectable.
- yes, I have sort-of reconstructed the retry loop that strong CAS does with LL/SC, BUT we can save the value obtained from get(), which does not happen in the emulation below.  We don't have multiple value return in java!

    public final V getAndUpdate(UnaryOperator<V> updateFunction) {
        for (;;) {
            V prev = get();
            V next = updateFunction.apply(prev);
            if (strongCas(prev, next))
                return prev;
        }
    }
    /** Implement strong CAS using weak CAS. */
    boolean strongCas(V expectedValue, V newValue) {
        while (get() == expectedValue) {
            if (weakCompareAndSetVolatile(expectedValue, newValue))
                return true;
        }
        return false;
    }


If you are gonna do why not just call “compareAndSet" instead of emulating with your own “strongCas” method? and then, i presume, we are back to where we started.


That code was just demonstrating how the return value from get() was getting lost, whereas if you inline a weak cas loop as I have done, you save the extra call to get().
 
I don’t think the “updateFunction” can tell a spurious failure from another failure given the variable, under contention, can change from and back to the previous between the “strongCas” call.

I'm imagining user code that counts all the calls to getAndUpdate, and all the calls to updateFunction.  If they differ, I would like that to be evidence of true contention.   True, we tell users updateFunction is supposed to be side-effect-free.  OTOH, we say that attempted updates may fail due to "contention among threads", suggesting we don't call updateFunction spuriously.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Paul Sandoz

> On 26 Sep 2016, at 16:32, Peter Levart <[hidden email]> wrote:
>
> Hi,
>
> I don't know about you, but on my i7 PC, I get consistently better results using strong CAS as opposed to weak CAS for the following benchmark:
>
> http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/GetAndUpdateBench.java
>
> The results:
>
> Benchmark                                      Mode  Cnt   Score   Error  Units
> GetAndUpdateBench.dflt                         avgt   20  53.947 ± 0.770  ns/op
> GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  54.103 ± 1.083  ns/op
> GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  53.791 ± 1.014  ns/op
> GetAndUpdateBench.martin                       avgt   20  52.724 ± 0.218  ns/op
> GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  52.379 ± 0.972  ns/op
> GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  53.070 ± 1.005  ns/op
> GetAndUpdateBench.shade                        avgt   20  53.437 ± 0.624  ns/op
> GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  53.401 ± 0.615  ns/op
> GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  53.474 ± 0.652  ns/op
> GetAndUpdateBench.strong                       avgt   20  38.090 ± 0.727  ns/op
> GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  37.784 ± 1.633  ns/op
> GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  38.397 ± 1.691  ns/op
>
For C2 on x86, the weak CAS intrinsic wires up in the same manner as the strong CAS intrinsic. IIRC for C1 on x86, weak CAS is not intrinsic.

On platforms/compilers that do not support the weak CAS intrinsic, the Unsafe weak CAS implementation defers to the Unsafe strong CAS.

I would be inclined to switch off tiered compilation and use parallel GC (not G1) and obtain results with a number of forks to see if that makes a difference. I am speculating, i really need to play with the benchmark and look at generated code/perfasm output.

I believe a Raspberry Pi 2 is based on an ARM Cortex-A7 processor, which is 32-bit, and i am unsure of the state of affairs w.r.t. HotSpot on that platform.

Paul.

>
> The benchmark exhibits 2 threads that don't have data contention (each of them is modifying its own variable), but most probably those threads do modify the same cache line. The update function does have a non-zero cost.
>
> To complement above results, I also ran the same code on Raspbery PI 2 with Oracle build of JDK 9 (b137):
>
> Benchmark                                      Mode  Cnt     Score   Error  Units
> GetAndUpdateBench.dflt                         avgt   20  2347.811 ± 2.492  ns/op
> GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  2347.328 ± 6.697  ns/op
> GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  2348.294 ± 7.052  ns/op
> GetAndUpdateBench.martin                       avgt   20  2342.773 ± 5.618  ns/op
> GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  2342.398 ± 8.662  ns/op
> GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  2343.148 ± 9.514  ns/op
> GetAndUpdateBench.shade                        avgt   20  1844.426 ± 4.304  ns/op
> GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  1841.747 ± 5.287  ns/op
> GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  1847.105 ± 6.416  ns/op
> GetAndUpdateBench.strong                       avgt   20  1846.413 ± 1.155  ns/op
> GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  1846.312 ± 8.095  ns/op
> GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  1846.513 ± 8.110  ns/op
>
>
> Here we may be observing a strange phenomena. Reexecuting the non-zero-cost update function on spurious failure (shade) surprisingly gives better throughput than immediately retrying weak CAS (dflt, martin).
>
> Also, wasn't strong CAS on ARM supposed to be mapped to a weak CAS loop? Why the difference between dflt/martin and strong then?
>
> You've got something to study now... ;-)
>
> Regards, Peter
>
>
>
> On 09/26/2016 07:51 PM, Martin Buchholz wrote:
>>
>>
>> On Mon, Sep 26, 2016 at 10:36 AM, Paul Sandoz <[hidden email]> wrote:
>>
>>> On 26 Sep 2016, at 10:01, Martin Buchholz <[hidden email]> wrote:
>>>
>>> Hi Aleksey,
>>>
>>> My goals were to
>>> - have the shortest code path if the CAS succeeds the first time, which is hopefully the common case
>>> - never call the update function multiple times in the case of spurious failure, only for real contention.  We don't know how expensive the update function is, and the number of times it is called is user-detectable.
>>> - yes, I have sort-of reconstructed the retry loop that strong CAS does with LL/SC, BUT we can save the value obtained from get(), which does not happen in the emulation below.  We don't have multiple value return in java!
>>>
>>>     public final V getAndUpdate(UnaryOperator<V> updateFunction) {
>>>         for (;;) {
>>>             V prev = get();
>>>             V next = updateFunction.apply(prev);
>>>             if (strongCas(prev, next))
>>>                 return prev;
>>>         }
>>>     }
>>>     /** Implement strong CAS using weak CAS. */
>>>     boolean strongCas(V expectedValue, V newValue) {
>>>         while (get() == expectedValue) {
>>>             if (weakCompareAndSetVolatile(expectedValue, newValue))
>>>                 return true;
>>>         }
>>>         return false;
>>>     }
>>>
>>
>> If you are gonna do why not just call “compareAndSet" instead of emulating with your own “strongCas” method? and then, i presume, we are back to where we started.
>>
>>
>> That code was just demonstrating how the return value from get() was getting lost, whereas if you inline a weak cas loop as I have done, you save the extra call to get().
>>
>> I don’t think the “updateFunction” can tell a spurious failure from another failure given the variable, under contention, can change from and back to the previous between the “strongCas” call.
>>
>> I'm imagining user code that counts all the calls to getAndUpdate, and all the calls to updateFunction.  If they differ, I would like that to be evidence of true contention.   True, we tell users updateFunction is supposed to be side-effect-free.  OTOH, we say that attempted updates may fail due to "contention among threads", suggesting we don't call updateFunction spuriously.
>>
>>
>> _______________________________________________
>> Concurrency-interest mailing list
>>
>> [hidden email]
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (858 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Martin Buchholz-3
In reply to this post by Peter Levart
weak cas should never be slower than strong cas.  There's still time to fix it before jdk9 ships!

Measuring performance in the presence of heavy contention is troublesome.  Optimizing code can make such microbenchmarks slower by increasing contention.  Will this benchmark get better if we change the implementation to grab a global lock and starve the other thread?
"""A key weakness of the CAS operation, known to both researchers and practitioners of concurrent programming, is its performance in the presence of memory contention. When multiple threads concurrently attempt to apply CAS operations to the same shared variable, typically at most a single thread will succeed in changing the shared variable’s value and the CAS operations of all other threads will fail. Moreover, significant degradation in performance occurs when variables manipulated by CAS become contention “hot spots”: as failed CAS operations generate coherence traffic on most architectures, they congest the interconnect and memory devices and slow down successful CAS operations,"""

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Peter Levart
In reply to this post by Paul Sandoz
Hi,

Ah, the pitfalls of benchmarking...

Even though I ran the benchmarks for several times and got consistent results, it seems that the consistency was accidental. As I said: "The benchmark exhibits 2 threads that don't have data contention (each of them is modifying its own variable), but most probably those threads do modify the same cache line."

"most probably" == "not necessarily"!!! Murphy interfering with "strong" benchmark it was. The Vars object was consistently allocated at an address so that value1 and value2 were not in the same cache line only in "strong" benchmark. Corrected benchmark that guarantees same cache-line for two int values in an aligned direct byte buffer:

    http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/GetAndUpdateBench2.java

...shows more uniform results on x86_64:

Benchmark                                       Mode  Cnt   Score   Error  Units
GetAndUpdateBench2.dflt                         avgt   20  56.683 ± 0.539  ns/op
GetAndUpdateBench2.dflt:getAndUpdate1_dflt      avgt   20  56.703 ± 0.640  ns/op
GetAndUpdateBench2.dflt:getAndUpdate2_dflt      avgt   20  56.662 ± 0.519  ns/op
GetAndUpdateBench2.martin                       avgt   20  56.343 ± 0.496  ns/op
GetAndUpdateBench2.martin:getAndUpdate1_martin  avgt   20  56.382 ± 0.486  ns/op
GetAndUpdateBench2.martin:getAndUpdate2_martin  avgt   20  56.305 ± 0.520  ns/op
GetAndUpdateBench2.shade                        avgt   20  55.447 ± 0.398  ns/op
GetAndUpdateBench2.shade:getAndUpdate1_shade    avgt   20  55.408 ± 0.526  ns/op
GetAndUpdateBench2.shade:getAndUpdate2_shade    avgt   20  55.486 ± 0.506  ns/op
GetAndUpdateBench2.strong                       avgt   20  55.689 ± 0.617  ns/op
GetAndUpdateBench2.strong:getAndUpdate1_strong  avgt   20  55.612 ± 0.601  ns/op
GetAndUpdateBench2.strong:getAndUpdate2_strong  avgt   20  55.766 ± 0.701  ns/op


And on Raspberry PI 2 (ARMv7) too:

Benchmark                                       Mode  Cnt     Score    Error  Units
GetAndUpdateBench2.dflt                         avgt   20  1953.743 ±  3.430  ns/op
GetAndUpdateBench2.dflt:getAndUpdate1_dflt      avgt   20  1953.332 ±  8.565  ns/op
GetAndUpdateBench2.dflt:getAndUpdate2_dflt      avgt   20  1954.154 ±  8.334  ns/op
GetAndUpdateBench2.martin                       avgt   20  1955.770 ±  6.025  ns/op
GetAndUpdateBench2.martin:getAndUpdate1_martin  avgt   20  1956.913 ± 10.290  ns/op
GetAndUpdateBench2.martin:getAndUpdate2_martin  avgt   20  1954.626 ±  6.957  ns/op
GetAndUpdateBench2.shade                        avgt   20  1959.441 ±  2.337  ns/op
GetAndUpdateBench2.shade:getAndUpdate1_shade    avgt   20  1961.199 ±  6.367  ns/op
GetAndUpdateBench2.shade:getAndUpdate2_shade    avgt   20  1957.682 ±  5.713  ns/op
GetAndUpdateBench2.strong                       avgt   20  1960.668 ±  8.764  ns/op
GetAndUpdateBench2.strong:getAndUpdate1_strong  avgt   20  1959.773 ± 10.804  ns/op
GetAndUpdateBench2.strong:getAndUpdate2_strong  avgt   20  1961.564 ±  8.086  ns/op


Regards, Peter


On 09/27/2016 03:07 AM, Paul Sandoz wrote:

      
On 26 Sep 2016, at 16:32, Peter Levart [hidden email] wrote:

Hi,

I don't know about you, but on my i7 PC, I get consistently better results using strong CAS as opposed to weak CAS for the following benchmark:

http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/GetAndUpdateBench.java

The results:

Benchmark                                      Mode  Cnt   Score   Error  Units
GetAndUpdateBench.dflt                         avgt   20  53.947 ± 0.770  ns/op
GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  54.103 ± 1.083  ns/op
GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  53.791 ± 1.014  ns/op
GetAndUpdateBench.martin                       avgt   20  52.724 ± 0.218  ns/op
GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  52.379 ± 0.972  ns/op
GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  53.070 ± 1.005  ns/op
GetAndUpdateBench.shade                        avgt   20  53.437 ± 0.624  ns/op
GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  53.401 ± 0.615  ns/op
GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  53.474 ± 0.652  ns/op
GetAndUpdateBench.strong                       avgt   20  38.090 ± 0.727  ns/op
GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  37.784 ± 1.633  ns/op
GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  38.397 ± 1.691  ns/op

For C2 on x86, the weak CAS intrinsic wires up in the same manner as the strong CAS intrinsic. IIRC for C1 on x86, weak CAS is not intrinsic.

On platforms/compilers that do not support the weak CAS intrinsic, the Unsafe weak CAS implementation defers to the Unsafe strong CAS.

I would be inclined to switch off tiered compilation and use parallel GC (not G1) and obtain results with a number of forks to see if that makes a difference. I am speculating, i really need to play with the benchmark and look at generated code/perfasm output.

I believe a Raspberry Pi 2 is based on an ARM Cortex-A7 processor, which is 32-bit, and i am unsure of the state of affairs w.r.t. HotSpot on that platform.

Paul.

The benchmark exhibits 2 threads that don't have data contention (each of them is modifying its own variable), but most probably those threads do modify the same cache line. The update function does have a non-zero cost.

To complement above results, I also ran the same code on Raspbery PI 2 with Oracle build of JDK 9 (b137):

Benchmark                                      Mode  Cnt     Score   Error  Units
GetAndUpdateBench.dflt                         avgt   20  2347.811 ± 2.492  ns/op
GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  2347.328 ± 6.697  ns/op
GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  2348.294 ± 7.052  ns/op
GetAndUpdateBench.martin                       avgt   20  2342.773 ± 5.618  ns/op
GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  2342.398 ± 8.662  ns/op
GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  2343.148 ± 9.514  ns/op
GetAndUpdateBench.shade                        avgt   20  1844.426 ± 4.304  ns/op
GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  1841.747 ± 5.287  ns/op
GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  1847.105 ± 6.416  ns/op
GetAndUpdateBench.strong                       avgt   20  1846.413 ± 1.155  ns/op
GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  1846.312 ± 8.095  ns/op
GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  1846.513 ± 8.110  ns/op


Here we may be observing a strange phenomena. Reexecuting the non-zero-cost update function on spurious failure (shade) surprisingly gives better throughput than immediately retrying weak CAS (dflt, martin).

Also, wasn't strong CAS on ARM supposed to be mapped to a weak CAS loop? Why the difference between dflt/martin and strong then?

You've got something to study now... ;-)

Regards, Peter



On 09/26/2016 07:51 PM, Martin Buchholz wrote:
On Mon, Sep 26, 2016 at 10:36 AM, Paul Sandoz [hidden email] wrote:

On 26 Sep 2016, at 10:01, Martin Buchholz [hidden email] wrote:

Hi Aleksey,

My goals were to
- have the shortest code path if the CAS succeeds the first time, which is hopefully the common case
- never call the update function multiple times in the case of spurious failure, only for real contention.  We don't know how expensive the update function is, and the number of times it is called is user-detectable.
- yes, I have sort-of reconstructed the retry loop that strong CAS does with LL/SC, BUT we can save the value obtained from get(), which does not happen in the emulation below.  We don't have multiple value return in java!

    public final V getAndUpdate(UnaryOperator<V> updateFunction) {
        for (;;) {
            V prev = get();
            V next = updateFunction.apply(prev);
            if (strongCas(prev, next))
                return prev;
        }
    }
    /** Implement strong CAS using weak CAS. */
    boolean strongCas(V expectedValue, V newValue) {
        while (get() == expectedValue) {
            if (weakCompareAndSetVolatile(expectedValue, newValue))
                return true;
        }
        return false;
    }

If you are gonna do why not just call “compareAndSet" instead of emulating with your own “strongCas” method? and then, i presume, we are back to where we started.


That code was just demonstrating how the return value from get() was getting lost, whereas if you inline a weak cas loop as I have done, you save the extra call to get().

I don’t think the “updateFunction” can tell a spurious failure from another failure given the variable, under contention, can change from and back to the previous between the “strongCas” call.

I'm imagining user code that counts all the calls to getAndUpdate, and all the calls to updateFunction.  If they differ, I would like that to be evidence of true contention.   True, we tell users updateFunction is supposed to be side-effect-free.  OTOH, we say that attempted updates may fail due to "contention among threads", suggesting we don't call updateFunction spuriously.


_______________________________________________
Concurrency-interest mailing list

[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

        

      


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Peter Levart

What I was hoping to measure with the benchmark is some difference between using weak CAS and strong CAS in a getAndUpdate retry loop that always invokes the update function (even on spurious failures of weak CAS). I can understand that this is not possible on x86_64 as both are mapped to the same strong CAS intrinsic. But I also haven't observed any meaningful difference on arm32. So is it possible that Oracle's JDK 9 b137 for arm32 also maps both weak and strong CAS to the same implementation which is a strong CAS simulation? Is there any platform where weak CAS is actually a weak CAS currently?


Regards, Peter


On 09/27/2016 11:39 AM, Peter Levart wrote:
Hi,

Ah, the pitfalls of benchmarking...

Even though I ran the benchmarks for several times and got consistent results, it seems that the consistency was accidental. As I said: "The benchmark exhibits 2 threads that don't have data contention (each of them is modifying its own variable), but most probably those threads do modify the same cache line."

"most probably" == "not necessarily"!!! Murphy interfering with "strong" benchmark it was. The Vars object was consistently allocated at an address so that value1 and value2 were not in the same cache line only in "strong" benchmark. Corrected benchmark that guarantees same cache-line for two int values in an aligned direct byte buffer:

    http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/GetAndUpdateBench2.java

...shows more uniform results on x86_64:

Benchmark                                       Mode  Cnt   Score   Error  Units
GetAndUpdateBench2.dflt                         avgt   20  56.683 ± 0.539  ns/op
GetAndUpdateBench2.dflt:getAndUpdate1_dflt      avgt   20  56.703 ± 0.640  ns/op
GetAndUpdateBench2.dflt:getAndUpdate2_dflt      avgt   20  56.662 ± 0.519  ns/op
GetAndUpdateBench2.martin                       avgt   20  56.343 ± 0.496  ns/op
GetAndUpdateBench2.martin:getAndUpdate1_martin  avgt   20  56.382 ± 0.486  ns/op
GetAndUpdateBench2.martin:getAndUpdate2_martin  avgt   20  56.305 ± 0.520  ns/op
GetAndUpdateBench2.shade                        avgt   20  55.447 ± 0.398  ns/op
GetAndUpdateBench2.shade:getAndUpdate1_shade    avgt   20  55.408 ± 0.526  ns/op
GetAndUpdateBench2.shade:getAndUpdate2_shade    avgt   20  55.486 ± 0.506  ns/op
GetAndUpdateBench2.strong                       avgt   20  55.689 ± 0.617  ns/op
GetAndUpdateBench2.strong:getAndUpdate1_strong  avgt   20  55.612 ± 0.601  ns/op
GetAndUpdateBench2.strong:getAndUpdate2_strong  avgt   20  55.766 ± 0.701  ns/op


And on Raspberry PI 2 (ARMv7) too:

Benchmark                                       Mode  Cnt     Score    Error  Units
GetAndUpdateBench2.dflt                         avgt   20  1953.743 ±  3.430  ns/op
GetAndUpdateBench2.dflt:getAndUpdate1_dflt      avgt   20  1953.332 ±  8.565  ns/op
GetAndUpdateBench2.dflt:getAndUpdate2_dflt      avgt   20  1954.154 ±  8.334  ns/op
GetAndUpdateBench2.martin                       avgt   20  1955.770 ±  6.025  ns/op
GetAndUpdateBench2.martin:getAndUpdate1_martin  avgt   20  1956.913 ± 10.290  ns/op
GetAndUpdateBench2.martin:getAndUpdate2_martin  avgt   20  1954.626 ±  6.957  ns/op
GetAndUpdateBench2.shade                        avgt   20  1959.441 ±  2.337  ns/op
GetAndUpdateBench2.shade:getAndUpdate1_shade    avgt   20  1961.199 ±  6.367  ns/op
GetAndUpdateBench2.shade:getAndUpdate2_shade    avgt   20  1957.682 ±  5.713  ns/op
GetAndUpdateBench2.strong                       avgt   20  1960.668 ±  8.764  ns/op
GetAndUpdateBench2.strong:getAndUpdate1_strong  avgt   20  1959.773 ± 10.804  ns/op
GetAndUpdateBench2.strong:getAndUpdate2_strong  avgt   20  1961.564 ±  8.086  ns/op


Regards, Peter


On 09/27/2016 03:07 AM, Paul Sandoz wrote:
On 26 Sep 2016, at 16:32, Peter Levart [hidden email] wrote:

Hi,

I don't know about you, but on my i7 PC, I get consistently better results using strong CAS as opposed to weak CAS for the following benchmark:

http://cr.openjdk.java.net/~plevart/misc/GetAndUpdate/GetAndUpdateBench.java

The results:

Benchmark                                      Mode  Cnt   Score   Error  Units
GetAndUpdateBench.dflt                         avgt   20  53.947 ± 0.770  ns/op
GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  54.103 ± 1.083  ns/op
GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  53.791 ± 1.014  ns/op
GetAndUpdateBench.martin                       avgt   20  52.724 ± 0.218  ns/op
GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  52.379 ± 0.972  ns/op
GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  53.070 ± 1.005  ns/op
GetAndUpdateBench.shade                        avgt   20  53.437 ± 0.624  ns/op
GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  53.401 ± 0.615  ns/op
GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  53.474 ± 0.652  ns/op
GetAndUpdateBench.strong                       avgt   20  38.090 ± 0.727  ns/op
GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  37.784 ± 1.633  ns/op
GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  38.397 ± 1.691  ns/op

For C2 on x86, the weak CAS intrinsic wires up in the same manner as the strong CAS intrinsic. IIRC for C1 on x86, weak CAS is not intrinsic.

On platforms/compilers that do not support the weak CAS intrinsic, the Unsafe weak CAS implementation defers to the Unsafe strong CAS.

I would be inclined to switch off tiered compilation and use parallel GC (not G1) and obtain results with a number of forks to see if that makes a difference. I am speculating, i really need to play with the benchmark and look at generated code/perfasm output.

I believe a Raspberry Pi 2 is based on an ARM Cortex-A7 processor, which is 32-bit, and i am unsure of the state of affairs w.r.t. HotSpot on that platform.

Paul.

The benchmark exhibits 2 threads that don't have data contention (each of them is modifying its own variable), but most probably those threads do modify the same cache line. The update function does have a non-zero cost.

To complement above results, I also ran the same code on Raspbery PI 2 with Oracle build of JDK 9 (b137):

Benchmark                                      Mode  Cnt     Score   Error  Units
GetAndUpdateBench.dflt                         avgt   20  2347.811 ± 2.492  ns/op
GetAndUpdateBench.dflt:getAndUpdate1_dflt      avgt   20  2347.328 ± 6.697  ns/op
GetAndUpdateBench.dflt:getAndUpdate2_dflt      avgt   20  2348.294 ± 7.052  ns/op
GetAndUpdateBench.martin                       avgt   20  2342.773 ± 5.618  ns/op
GetAndUpdateBench.martin:getAndUpdate1_martin  avgt   20  2342.398 ± 8.662  ns/op
GetAndUpdateBench.martin:getAndUpdate2_martin  avgt   20  2343.148 ± 9.514  ns/op
GetAndUpdateBench.shade                        avgt   20  1844.426 ± 4.304  ns/op
GetAndUpdateBench.shade:getAndUpdate1_shade    avgt   20  1841.747 ± 5.287  ns/op
GetAndUpdateBench.shade:getAndUpdate2_shade    avgt   20  1847.105 ± 6.416  ns/op
GetAndUpdateBench.strong                       avgt   20  1846.413 ± 1.155  ns/op
GetAndUpdateBench.strong:getAndUpdate1_strong  avgt   20  1846.312 ± 8.095  ns/op
GetAndUpdateBench.strong:getAndUpdate2_strong  avgt   20  1846.513 ± 8.110  ns/op


Here we may be observing a strange phenomena. Reexecuting the non-zero-cost update function on spurious failure (shade) surprisingly gives better throughput than immediately retrying weak CAS (dflt, martin).

Also, wasn't strong CAS on ARM supposed to be mapped to a weak CAS loop? Why the difference between dflt/martin and strong then?

You've got something to study now... ;-)

Regards, Peter



On 09/26/2016 07:51 PM, Martin Buchholz wrote:
On Mon, Sep 26, 2016 at 10:36 AM, Paul Sandoz [hidden email] wrote:

On 26 Sep 2016, at 10:01, Martin Buchholz [hidden email] wrote:

Hi Aleksey,

My goals were to
- have the shortest code path if the CAS succeeds the first time, which is hopefully the common case
- never call the update function multiple times in the case of spurious failure, only for real contention.  We don't know how expensive the update function is, and the number of times it is called is user-detectable.
- yes, I have sort-of reconstructed the retry loop that strong CAS does with LL/SC, BUT we can save the value obtained from get(), which does not happen in the emulation below.  We don't have multiple value return in java!

    public final V getAndUpdate(UnaryOperator<V> updateFunction) {
        for (;;) {
            V prev = get();
            V next = updateFunction.apply(prev);
            if (strongCas(prev, next))
                return prev;
        }
    }
    /** Implement strong CAS using weak CAS. */
    boolean strongCas(V expectedValue, V newValue) {
        while (get() == expectedValue) {
            if (weakCompareAndSetVolatile(expectedValue, newValue))
                return true;
        }
        return false;
    }

If you are gonna do why not just call “compareAndSet" instead of emulating with your own “strongCas” method? and then, i presume, we are back to where we started.


That code was just demonstrating how the return value from get() was getting lost, whereas if you inline a weak cas loop as I have done, you save the extra call to get().

I don’t think the “updateFunction” can tell a spurious failure from another failure given the variable, under contention, can change from and back to the previous between the “strongCas” call.

I'm imagining user code that counts all the calls to getAndUpdate, and all the calls to updateFunction.  If they differ, I would like that to be evidence of true contention.   True, we tell users updateFunction is supposed to be side-effect-free.  OTOH, we say that attempted updates may fail due to "contention among threads", suggesting we don't call updateFunction spuriously.


_______________________________________________
Concurrency-interest mailing list

[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Andrew Haley
On 27/09/16 11:44, Peter Levart wrote:

> What I was hoping to measure with the benchmark is some difference
> between using weak CAS and strong CAS in a getAndUpdate retry loop
> that always invokes the update function (even on spurious failures
> of weak CAS).

That's going to be very difficult.  In practice truly spurious
failures are rare enough that you won't see them unless you have false
sharing.  (In which case, of course, strong CAS slows down too.)  You
benefit from using weak CAS at times of high contention by not
spinning to retry a SC which is going to fail anyway.

> I can understand that this is not possible on x86_64 as both are
> mapped to the same strong CAS intrinsic. But I also haven't observed
> any meaningful difference on arm32. So is it possible that Oracle's
> JDK 9 b137 for arm32 also maps both weak and strong CAS to the same
> implementation which is a strong CAS simulation?

Certainly.  The only way to be enlightened is to look at the generated
source.

> Is there any platform where weak CAS is actually a weak CAS
> currently?

AArch64.

Andrew.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Peter Levart

Hi Andrew,


On 09/27/2016 02:56 PM, Andrew Haley wrote:
On 27/09/16 11:44, Peter Levart wrote:

What I was hoping to measure with the benchmark is some difference
between using weak CAS and strong CAS in a getAndUpdate retry loop
that always invokes the update function (even on spurious failures
of weak CAS).
That's going to be very difficult.  In practice truly spurious
failures are rare enough that you won't see them unless you have false
sharing.  

And that's exactly what my benchmark tries to provoke. It uses CAS to concurrently update two distinct locations from two distinct threads (no data contention) with a common cache-line. I was trying to measure the worst case impact of re-invoking the update function on spurious failure of weak CAS vs. using strong CAS that would never fail in my benchmark (as it is strong), as a function of the update function execution time. But it appears there is no difference on Oracle's arm32 no matter how long the update function executes.

(In which case, of course, strong CAS slows down too.)  You
benefit from using weak CAS at times of high contention by not
spinning to retry a SC which is going to fail anyway.

I can understand that this is not possible on x86_64 as both are
mapped to the same strong CAS intrinsic. But I also haven't observed
any meaningful difference on arm32. So is it possible that Oracle's
JDK 9 b137 for arm32 also maps both weak and strong CAS to the same
implementation which is a strong CAS simulation?
Certainly.  The only way to be enlightened is to look at the generated
source.

Is there any platform where weak CAS is actually a weak CAS
currently?
AArch64.

Ah. I don't have any device for that (yet) ;-)

Thanks, Peter

Andrew.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Andrew Haley
On 27/09/16 15:39, Peter Levart wrote:
> Ah. I don't have any device for that (yet) ;-)

Well, I can do it if you tell me exactly what to do.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Martin Buchholz-3
In reply to this post by Martin Buchholz-3
My proposal produces strictly cleaner, more compact source code and bytecode, while maintaining the property that updateFunction will never be called just because of a spurious weak cas failure, without loss of performance, so ... COMMIT IS YES?

(ambitious Dice-style cas optimizations are a future project)

(I would even be OK with *specifying* that updateFunction will never be called spuriously, but that's also a future project)

(Paul/Doug: is there a VarHandle equivalent to Atomic*.getAndUpdate?  Another future project?)

On Sat, Sep 24, 2016 at 9:51 AM, Martin Buchholz <[hidden email]> wrote:
Discussion on CAS loops got me looking again at our own:

    public final V getAndUpdate(UnaryOperator<V> updateFunction) {
        V prev = get(), next = null;
        for (boolean haveNext = false;;) {
            if (!haveNext)
                next = updateFunction.apply(prev);
            if (weakCompareAndSetVolatile(prev, next))
                return prev;
            haveNext = (prev == (prev = get()));
        }
    }

The haveNext boolean and useless initialization of next bothers me.  We can do better!

    public final V getAndUpdate(UnaryOperator<V> updateFunction) {
        for (V prev = get();;) {
            V next = updateFunction.apply(prev);
            do {
                if (weakCompareAndSetVolatile(prev, next))
                    return prev;
            } while (prev == (prev = get()));
        }
    }

even though it probably saves more bytecodes than cycles.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: The very best CAS loop

Andrew Haley
On 27/09/16 19:18, Martin Buchholz wrote:

> My proposal produces strictly cleaner, more compact source code and
> bytecode, while maintaining the property that updateFunction will
> never be called just because of a spurious weak cas failure, without
> loss of performance, so ... COMMIT IS YES?

I don't think so, for reasons discussed.

Andrew.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
123