AtomicReferenceArray.get() and intrinsics method inlining

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

AtomicReferenceArray.get() and intrinsics method inlining

JSR166 Concurrency mailing list
Hi group,

For the past few weeks, we've been attempting to diagnose a caching performance issue. Profiling with YourKit it became evident that the caching layer (EhCache 2.10.x) was the culprit. This caching library uses a fork of ConcurrentHashMapV8 (pre-Java8) behind the scenes.

The ConcurrentHashMapV8 fork has been customized for internal EhCache usage, however looking at the source, it does not differ that much from the original ConcurrentHashMapV8. Sources for EhCache 2.10.5 can be downloaded from https://www.ehcache.org/downloads/
The ConcurrentHashMapV8 fork is net.sf.ehcache.util.concurrent.ConcurrentHashMap

Further profiling revealed that the hotspot was not within the ConcurrentHashMapV8 fork, but rather at j.u.c.atomic.AtomicReferenceArray.get()
Apparently, the EhCache team decided at some point to replace Unsafe.getObjectVolatile() with AtomicReferenceArray, presumably to avoid compiler warnings regarding the unsupported use of Unsafe (this was way before Java 9, the module system, Unsafe deprecation, etc.)

The logic seems straightforward and there are no obvious errors in their use of AtomicReferenceArray. The switch from Unsafe.getObjectVolatile() to AtomicReferenceArray.get() works fine.
However, we performed the following micro-benchmark: we inserted a single entry [key = Integer.valueOf(1); value = new Object()] and then performed map.get() a billion times for the same key.

The ConcurrentHashMapV8 took about 40 seconds on average to complete. However, stock Java 8 ConcurrentHashMap took only about 800 ms to complete the same task. Profiling showed that the ConcurrentHashMapV8 fork hotspot was at AtomicReferenceArray.get(), which matched the issue we found in our production systems with very "hot" cache keys.

We produced a simple test class to exclude everything EhCache-related and we were able to reproduce this same behaviour with direct AtomicReferenceArray usage. This benchmark does not attempt to measure anything useful in the real world, but it's simple enough so that this performance difference became easily reproducible:

///////////////////////////////////////////////////////////////////////////////////////////////////////////////
public class Test {
    private static final Unsafe UNSAFE;
    private static final int KEY = 1;
    static {
        try {
            Field f = Unsafe.class.getDeclaredField("theUnsafe");
            f.setAccessible(true);
            UNSAFE = (Unsafe) f.get(null);
        } catch (Exception e) {
            throw new Error(e);
        }
    }

    public static void main(String[] args) {
        final long repetitions = 1_000_000_000L;
        System.out.println("ara=" + ara(repetitions) + " ms");
    }

    private static final long ara(final long repetitions) {
        final AtomicReferenceArray<Object> ara = new AtomicReferenceArray<Object>(100);
        ara.set(KEY, new Object());
        long start = System.nanoTime();
        for (long i = 0; i < repetitions; i++) {
            ara.get(KEY);
        }
        long end = System.nanoTime();
        return (end - start) / (1000 * 1000);
    }
}
///////////////////////////////////////////////////////////////////////////////////////////////////////////////

This test took about 40 seconds on the exact same hardware as previous CHM tests. So we concluded that AtomicReferenceArray.get() usage instead of Unsafe.getObjectVolatile() was the cause behind the ConcurrentHashMapV8 EhCache fork being so much slower than Java8 stock ConcurrentHashMap.

The difference is so huge that we suspected this could be a result of compiler optimizations resulting in very different bytecode, and possibly method inlining not working as expected with Unsafe instricts. This was dead on: we enabled the following JVM diagnosing options:
-server -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining -XX:-TieredCompilation -XX:MaxInlineLevel=15

And the test yielded the following on standard output (only the relevant part is reproduced below):

 com.renxo.cms.Test::ara @ 32 (65 bytes)
                            @ 34   java.util.concurrent.atomic.AtomicReferenceArray::get (10 bytes)   inline (hot)
                              @ 3   java.util.concurrent.atomic.AtomicReferenceArray::checkedByteOffset (45 bytes)   inline (hot)
                                @ 41   java.util.concurrent.atomic.AtomicReferenceArray::byteOffset (12 bytes)   inline (hot)
                              @ 6   java.util.concurrent.atomic.AtomicReferenceArray::getRaw (12 bytes)   inline (hot)
                                @ 8   sun.misc.Unsafe::getObjectVolatile (0 bytes)   failed to inline (intrinsic)
                                @ 8   sun.misc.Unsafe::getObjectVolatile (0 bytes)   native method

So this is the interesting bit:
sun.misc.Unsafe::getObjectVolatile (0 bytes)   failed to inline (intrinsic)

Running a similar benchmark against stock Java8 ConcurrentHashMap.get() showed that sun.misc.Unsafe::getObjectVolatile was being inlined just fine. So something at AtomicReferenceArray.get() is preventing getObjectVolatile() to be inlined. Since this is an intrinsic function, performance difference should be huge.

After careful studying of stock Java8 ConcurrentHashMap.get(), we found that the reason why that method was being successfully inlined is the (tab = table) != null check before tabAt() is invoked. Apparently, the HotSpot compiler is unable to inline getObjectVolatile() unless it can verify that its main argument will always be non-null.

So, we created an AtomicReferenceArray fork with an alternative getRaw() implementation:

    private E getRaw(long offset) {
        Object[] array = this.array;
        if (array == null) {
            throw new IllegalStateException();
        }
        return (E) unsafe.getObjectVolatile(array, offset);
    }

This resulted in the test taking less than 500 ms instead of 40 seconds (that's an 80-fold difference!). And the compiler was now successfully inlining:

com.renxo.cms.Test::ara @ 32 (65 bytes)
                            @ 34   com.renxo.cms.AtomicReferenceArray::get (10 bytes)   inline (hot)
                              @ 3   com.renxo.cms.AtomicReferenceArray::checkedByteOffset (42 bytes)   inline (hot)
                                @ 38   com.renxo.cms.AtomicReferenceArray::byteOffset (12 bytes)   inline (hot)
                              @ 6   com.renxo.cms.AtomicReferenceArray::getRaw (26 bytes)   inline (hot)
                                @ 22   sun.misc.Unsafe::getObjectVolatile (0 bytes)   (intrinsic)

We were extremely surprised that AtomicReferenceArray being a core library could benefit so much from such a small change. It is likely that our particular test is not representative of real-world workloads, but still, is there any reason why this fix should not be introduced in OpenJDK? We see that the current trunk AtomicReferenceArray no longer uses Unsafe.getObjectVolatile() and instead uses VarHandles. We're still on Java8 so we did not test this on Java9 or later. But it is possible that the same issue is still there.

Also, we just focused on AtomicReferenceArray.get(). It is likely that other methods using Unsafe having similar inlining issues. Also, other j.u.c.atomic classes might exhibit similar inlining behaviour.

Concurrency experts, we'd appreciate your feedback on our findings.
Thanks,

Manuel Dominguez Sarmiento


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: AtomicReferenceArray.get() and intrinsics method inlining

JSR166 Concurrency mailing list
On 1/16/20 8:43 PM, Manuel Dominguez Sarmiento via Concurrency-interest wrote:

>     private static final long ara(final long repetitions) {
>         final AtomicReferenceArray<Object> ara = new AtomicReferenceArray<Object>(100);
>         ara.set(KEY, new Object());
>         long start = System.nanoTime();
>         for (long i = 0; i < repetitions; i++) {
>             ara.get(KEY);
>         }
>         long end = System.nanoTime();
>         return (end - start) / (1000 * 1000);
>     }
> }

For the love of all that is holy, use JMH:
  https://openjdk.java.net/projects/code-tools/jmh/

> This test took about 40 seconds on the exact same hardware as previous CHM tests. So we concluded
> that AtomicReferenceArray.get() usage instead of Unsafe.getObjectVolatile() was the cause behind the
> ConcurrentHashMapV8 EhCache fork being so much slower than Java8 stock ConcurrentHashMap.

> So this is the interesting bit:
> sun.misc.Unsafe::getObjectVolatile (0 bytes)   failed to inline (intrinsic)

Cannot reproduce:

$ java -server -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
-XX:-TieredCompilation -XX:MaxInlineLevel=15 Test
java.util.concurrent.atomic.AtomicReferenceArray::byteOffset (12 bytes)

   @ 41   java.util.concurrent.atomic.AtomicReferenceArray::byteOffset (12 bytes)   inline (hot)

java.util.concurrent.atomic.AtomicReferenceArray::getRaw (12 bytes)

   @ 8   sun.misc.Unsafe::getObjectVolatile (0 bytes)   (intrinsic)

   @ 3   java.util.concurrent.atomic.AtomicReferenceArray::checkedByteOffset (45 bytes)   inline
(hot)
     @ 41   java.util.concurrent.atomic.AtomicReferenceArray::byteOffset (12 bytes)   inline (hot)

   @ 6   java.util.concurrent.atomic.AtomicReferenceArray::getRaw (12 bytes)   inline (hot)

     @ 8   sun.misc.Unsafe::getObjectVolatile (0 bytes)   (intrinsic)

Test::ara @ 29 (65 bytes)

   @ 38   java.util.concurrent.atomic.AtomicReferenceArray::get (10 bytes)   inline (hot)

     @ 3   java.util.concurrent.atomic.AtomicReferenceArray::checkedByteOffset (45 bytes)   inline
(hot)
       @ 41   java.util.concurrent.atomic.AtomicReferenceArray::byteOffset (12 bytes)   inline (hot)

     @ 6   java.util.concurrent.atomic.AtomicReferenceArray::getRaw (12 bytes)   inline (hot)

       @ 8   sun.misc.Unsafe::getObjectVolatile (0 bytes)   (intrinsic)

Test::ara @ -2 (65 bytes)   made not entrant

ara=429 ms


$ java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.232-b09, mixed mode)


> After careful studying of stock Java8 ConcurrentHashMap.get(), we found that the reason why that
>  method was being successfully inlined is the (tab = table) != null check before tabAt() is
> invoked. Apparently, the HotSpot compiler is unable to inline getObjectVolatile() unless it can
> verify that its main argument will always be non-null.
If true, that would qualify as performance bug. Hotspot should be able to inline Unsafe accessors,
and it would emit runtime null-checks if it is not sure about nullity.

Please try with more up to date JDK binary.

> The ConcurrentHashMapV8 took about 40 seconds on average to complete. However, stock Java 8
> ConcurrentHashMap took only about 800 ms to complete the same task. Profiling showed that the
> ConcurrentHashMapV8 fork hotspot was at AtomicReferenceArray.get(), which matched the issue we
> found in our production systems with very "hot" cache keys.
The more likely causes would be:
 a) additional memory dereference every time "table" is accessed. In that frankenstein-monster of
CHM it would be additional dereference to reach the backing array in ARA itself;
 b) profiling skew that misattributed the bottleneck to ARA.get();
 c) some subtle difference between the fork and the "stock" CHM version;

The trouble here is that you have the minimized test that does not show the problem :/ Please
provide MCVE for the actual problem you are chasing (pull the exact CHM sources into there, if you
have to), and full details on the environment you run the test in.

--
Thanks,
-Aleksey

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: AtomicReferenceArray.get() and intrinsics method inlining

JSR166 Concurrency mailing list
Thanks Aleksey for your feedback. We know about JHM but we have no experience with it. We'll definitely use it in the future.
We used Oracle JDK 1.8.0_212 on Mac OS X to produce the reported results. Update 212 is from April 2019 so it's not that old anyway.
We'll re-test on the latest 1.8.x JDK and report back.

Manuel Dominguez Sarmiento

On 16/01/2020 17:15, Aleksey Shipilev wrote:
On 1/16/20 8:43 PM, Manuel Dominguez Sarmiento via Concurrency-interest wrote:
    private static final long ara(final long repetitions) {
        final AtomicReferenceArray<Object> ara = new AtomicReferenceArray<Object>(100);
        ara.set(KEY, new Object());
        long start = System.nanoTime();
        for (long i = 0; i < repetitions; i++) {
            ara.get(KEY);
        }
        long end = System.nanoTime();
        return (end - start) / (1000 * 1000);
    }
}
For the love of all that is holy, use JMH:
  https://openjdk.java.net/projects/code-tools/jmh/

This test took about 40 seconds on the exact same hardware as previous CHM tests. So we concluded
that AtomicReferenceArray.get() usage instead of Unsafe.getObjectVolatile() was the cause behind the
ConcurrentHashMapV8 EhCache fork being so much slower than Java8 stock ConcurrentHashMap.

      
So this is the interesting bit:
sun.misc.Unsafe::getObjectVolatile (0 bytes)   failed to inline (intrinsic)
Cannot reproduce:

$ java -server -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
-XX:-TieredCompilation -XX:MaxInlineLevel=15 Test
java.util.concurrent.atomic.AtomicReferenceArray::byteOffset (12 bytes)

   @ 41   java.util.concurrent.atomic.AtomicReferenceArray::byteOffset (12 bytes)   inline (hot)

java.util.concurrent.atomic.AtomicReferenceArray::getRaw (12 bytes)

   @ 8   sun.misc.Unsafe::getObjectVolatile (0 bytes)   (intrinsic)

   @ 3   java.util.concurrent.atomic.AtomicReferenceArray::checkedByteOffset (45 bytes)   inline
(hot)
     @ 41   java.util.concurrent.atomic.AtomicReferenceArray::byteOffset (12 bytes)   inline (hot)

   @ 6   java.util.concurrent.atomic.AtomicReferenceArray::getRaw (12 bytes)   inline (hot)

     @ 8   sun.misc.Unsafe::getObjectVolatile (0 bytes)   (intrinsic)

Test::ara @ 29 (65 bytes)

   @ 38   java.util.concurrent.atomic.AtomicReferenceArray::get (10 bytes)   inline (hot)

     @ 3   java.util.concurrent.atomic.AtomicReferenceArray::checkedByteOffset (45 bytes)   inline
(hot)
       @ 41   java.util.concurrent.atomic.AtomicReferenceArray::byteOffset (12 bytes)   inline (hot)

     @ 6   java.util.concurrent.atomic.AtomicReferenceArray::getRaw (12 bytes)   inline (hot)

       @ 8   sun.misc.Unsafe::getObjectVolatile (0 bytes)   (intrinsic)

Test::ara @ -2 (65 bytes)   made not entrant

ara=429 ms


$ java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.232-b09, mixed mode)


After careful studying of stock Java8 ConcurrentHashMap.get(), we found that the reason why that
 method was being successfully inlined is the (tab = table) != null check before tabAt() is 
invoked. Apparently, the HotSpot compiler is unable to inline getObjectVolatile() unless it can 
verify that its main argument will always be non-null.
If true, that would qualify as performance bug. Hotspot should be able to inline Unsafe accessors,
and it would emit runtime null-checks if it is not sure about nullity.

Please try with more up to date JDK binary.

The ConcurrentHashMapV8 took about 40 seconds on average to complete. However, stock Java 8 
ConcurrentHashMap took only about 800 ms to complete the same task. Profiling showed that the 
ConcurrentHashMapV8 fork hotspot was at AtomicReferenceArray.get(), which matched the issue we 
found in our production systems with very "hot" cache keys.
The more likely causes would be:
 a) additional memory dereference every time "table" is accessed. In that frankenstein-monster of
CHM it would be additional dereference to reach the backing array in ARA itself;
 b) profiling skew that misattributed the bottleneck to ARA.get();
 c) some subtle difference between the fork and the "stock" CHM version;

The trouble here is that you have the minimized test that does not show the problem :/ Please
provide MCVE for the actual problem you are chasing (pull the exact CHM sources into there, if you
have to), and full details on the environment you run the test in.



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: AtomicReferenceArray.get() and intrinsics method inlining

JSR166 Concurrency mailing list
On 1/16/20 9:28 PM, Manuel Dominguez Sarmiento wrote:
> We used Oracle JDK 1.8.0_212 on Mac OS X to produce the reported results. Update 212 is from April
> 2019 so it's not that old anyway.

Wait, now *that* sounds familiar.

Plus the original observation:

> After careful studying of stock Java8 ConcurrentHashMap.get(), we found that the reason why that
>  method was being successfully inlined is the (tab = table) != null check before tabAt() is
> invoked. Apparently, the HotSpot compiler is unable to inline getObjectVolatile() unless it can
> verify thatits main argument will always be non-null.
Suggests this:
  https://bugs.openjdk.java.net/browse/JDK-8221355

You should really try up-to-date JDK.

--
Thanks,
-Aleksey


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: AtomicReferenceArray.get() and intrinsics method inlining

JSR166 Concurrency mailing list
I have not investigated Ehcache 2.x for a while, but it used to use SelectableConcurrentHashMap [1] which is a fork of Java 5's map. In that fork, the lock-free reads is replaced by a per-segment read lock.

At the time of 2.10.4, this was the default implementation when creating a cache in a benchmark [2] as,
    CacheConfiguration config = new CacheConfiguration("benchmark", maximumSize);
    config.setMemoryStoreEvictionPolicyFromObject(evictionPolicy);
    cache = new Cache(config);

That JMH benchmark ran with a Zipfian distribution (so hot keys are accessed more frequently) and I observed ~20M gets per second at 16 cores, using the LRU policy.

The benchmark might be a good starting point. I had to remove v2 when it became incompatible with v3 due to [3].

This was applicable around 2014-15 timeframe and I have not looked at it since, so beware of my possible misunderstandings.


On Thu, Jan 16, 2020 at 12:36 PM Aleksey Shipilev via Concurrency-interest <[hidden email]> wrote:
On 1/16/20 9:28 PM, Manuel Dominguez Sarmiento wrote:
> We used Oracle JDK 1.8.0_212 on Mac OS X to produce the reported results. Update 212 is from April
> 2019 so it's not that old anyway.

Wait, now *that* sounds familiar.

Plus the original observation:

> After careful studying of stock Java8 ConcurrentHashMap.get(), we found that the reason why that
>  method was being successfully inlined is the (tab = table) != null check before tabAt() is
> invoked. Apparently, the HotSpot compiler is unable to inline getObjectVolatile() unless it can
> verify thatits main argument will always be non-null.
Suggests this:
  https://bugs.openjdk.java.net/browse/JDK-8221355

You should really try up-to-date JDK.

--
Thanks,
-Aleksey

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: AtomicReferenceArray.get() and intrinsics method inlining

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
Repeated the test on 8u241 ... the issue is GONE
Also, testing the forked ConcurrentHashMapV8, performance is now
up-to-par with stock Java8 ConcurrentHashMap.

Seems JDK-8221355 was the culprit.
Thanks Aleksey!

> On 1/16/20 9:28 PM, Manuel Dominguez Sarmiento wrote:
>> We used Oracle JDK 1.8.0_212 on Mac OS X to produce the reported results. Update 212 is from April
>> 2019 so it's not that old anyway.
> Wait, now *that* sounds familiar.
>
> Plus the original observation:
>
>> After careful studying of stock Java8 ConcurrentHashMap.get(), we found that the reason why that
>>   method was being successfully inlined is the (tab = table) != null check before tabAt() is
>> invoked. Apparently, the HotSpot compiler is unable to inline getObjectVolatile() unless it can
>> verify thatits main argument will always be non-null.
> Suggests this:
>    https://bugs.openjdk.java.net/browse/JDK-8221355
>
> You should really try up-to-date JDK.
>

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: AtomicReferenceArray.get() and intrinsics method inlining

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
Hi Ben, thanks to your comment, we found an issue in some critical interning and reflection-related classes in our codebase: they were using net.sf.ehcache.util.concurrent.ConcurrentHashMap instead of java.util.concurrent.ConcurrentHashMap

It seems Eclipse's auto-import feature picked the "wrong" ConcurrentHashMap ... so that's why we're seeing the net.sf.ehcache.util.concurrent.ConcurrentHashMap in our profiler even if it's not the default map EhCache might be using for the actual caches.

Lessons learned: always use an up-to-date JDK and never trust Eclipse's auto-import feature.

I have not investigated Ehcache 2.x for a while, but it used to use SelectableConcurrentHashMap [1] which is a fork of Java 5's map. In that fork, the lock-free reads is replaced by a per-segment read lock.

At the time of 2.10.4, this was the default implementation when creating a cache in a benchmark [2] as,
    CacheConfiguration config = new CacheConfiguration("benchmark", maximumSize);
    config.setMemoryStoreEvictionPolicyFromObject(evictionPolicy);
    cache = new Cache(config);

That JMH benchmark ran with a Zipfian distribution (so hot keys are accessed more frequently) and I observed ~20M gets per second at 16 cores, using the LRU policy.

The benchmark might be a good starting point. I had to remove v2 when it became incompatible with v3 due to [3].

This was applicable around 2014-15 timeframe and I have not looked at it since, so beware of my possible misunderstandings.


On Thu, Jan 16, 2020 at 12:36 PM Aleksey Shipilev via Concurrency-interest <[hidden email]> wrote:
On 1/16/20 9:28 PM, Manuel Dominguez Sarmiento wrote:
> We used Oracle JDK 1.8.0_212 on Mac OS X to produce the reported results. Update 212 is from April
> 2019 so it's not that old anyway.

Wait, now *that* sounds familiar.

Plus the original observation:

> After careful studying of stock Java8 ConcurrentHashMap.get(), we found that the reason why that
>  method was being successfully inlined is the (tab = table) != null check before tabAt() is
> invoked. Apparently, the HotSpot compiler is unable to inline getObjectVolatile() unless it can
> verify thatits main argument will always be non-null.
Suggests this:
  https://bugs.openjdk.java.net/browse/JDK-8221355

You should really try up-to-date JDK.

--
Thanks,
-Aleksey

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: AtomicReferenceArray.get() and intrinsics method inlining

JSR166 Concurrency mailing list
@manuel and attempts to use JMH too! :P

Il gio 16 gen 2020, 22:09 Manuel Dominguez Sarmiento via Concurrency-interest <[hidden email]> ha scritto:
Hi Ben, thanks to your comment, we found an issue in some critical interning and reflection-related classes in our codebase: they were using net.sf.ehcache.util.concurrent.ConcurrentHashMap instead of java.util.concurrent.ConcurrentHashMap

It seems Eclipse's auto-import feature picked the "wrong" ConcurrentHashMap ... so that's why we're seeing the net.sf.ehcache.util.concurrent.ConcurrentHashMap in our profiler even if it's not the default map EhCache might be using for the actual caches.

Lessons learned: always use an up-to-date JDK and never trust Eclipse's auto-import feature.

I have not investigated Ehcache 2.x for a while, but it used to use SelectableConcurrentHashMap [1] which is a fork of Java 5's map. In that fork, the lock-free reads is replaced by a per-segment read lock.

At the time of 2.10.4, this was the default implementation when creating a cache in a benchmark [2] as,
    CacheConfiguration config = new CacheConfiguration("benchmark", maximumSize);
    config.setMemoryStoreEvictionPolicyFromObject(evictionPolicy);
    cache = new Cache(config);

That JMH benchmark ran with a Zipfian distribution (so hot keys are accessed more frequently) and I observed ~20M gets per second at 16 cores, using the LRU policy.

The benchmark might be a good starting point. I had to remove v2 when it became incompatible with v3 due to [3].

This was applicable around 2014-15 timeframe and I have not looked at it since, so beware of my possible misunderstandings.


On Thu, Jan 16, 2020 at 12:36 PM Aleksey Shipilev via Concurrency-interest <[hidden email]> wrote:
On 1/16/20 9:28 PM, Manuel Dominguez Sarmiento wrote:
> We used Oracle JDK 1.8.0_212 on Mac OS X to produce the reported results. Update 212 is from April
> 2019 so it's not that old anyway.

Wait, now *that* sounds familiar.

Plus the original observation:

> After careful studying of stock Java8 ConcurrentHashMap.get(), we found that the reason why that
>  method was being successfully inlined is the (tab = table) != null check before tabAt() is
> invoked. Apparently, the HotSpot compiler is unable to inline getObjectVolatile() unless it can
> verify thatits main argument will always be non-null.
Suggests this:
  https://bugs.openjdk.java.net/browse/JDK-8221355

You should really try up-to-date JDK.

--
Thanks,
-Aleksey

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: AtomicReferenceArray.get() and intrinsics method inlining

JSR166 Concurrency mailing list
Sure :-)

On 16/01/2020 18:52, Francesco Nigro wrote:
@manuel and attempts to use JMH too! :P

Il gio 16 gen 2020, 22:09 Manuel Dominguez Sarmiento via Concurrency-interest <[hidden email]> ha scritto:
Hi Ben, thanks to your comment, we found an issue in some critical interning and reflection-related classes in our codebase: they were using net.sf.ehcache.util.concurrent.ConcurrentHashMap instead of java.util.concurrent.ConcurrentHashMap

It seems Eclipse's auto-import feature picked the "wrong" ConcurrentHashMap ... so that's why we're seeing the net.sf.ehcache.util.concurrent.ConcurrentHashMap in our profiler even if it's not the default map EhCache might be using for the actual caches.

Lessons learned: always use an up-to-date JDK and never trust Eclipse's auto-import feature.

I have not investigated Ehcache 2.x for a while, but it used to use SelectableConcurrentHashMap [1] which is a fork of Java 5's map. In that fork, the lock-free reads is replaced by a per-segment read lock.

At the time of 2.10.4, this was the default implementation when creating a cache in a benchmark [2] as,
    CacheConfiguration config = new CacheConfiguration("benchmark", maximumSize);
    config.setMemoryStoreEvictionPolicyFromObject(evictionPolicy);
    cache = new Cache(config);

That JMH benchmark ran with a Zipfian distribution (so hot keys are accessed more frequently) and I observed ~20M gets per second at 16 cores, using the LRU policy.

The benchmark might be a good starting point. I had to remove v2 when it became incompatible with v3 due to [3].

This was applicable around 2014-15 timeframe and I have not looked at it since, so beware of my possible misunderstandings.


On Thu, Jan 16, 2020 at 12:36 PM Aleksey Shipilev via Concurrency-interest <[hidden email]> wrote:
On 1/16/20 9:28 PM, Manuel Dominguez Sarmiento wrote:
> We used Oracle JDK 1.8.0_212 on Mac OS X to produce the reported results. Update 212 is from April
> 2019 so it's not that old anyway.

Wait, now *that* sounds familiar.

Plus the original observation:

> After careful studying of stock Java8 ConcurrentHashMap.get(), we found that the reason why that
>  method was being successfully inlined is the (tab = table) != null check before tabAt() is
> invoked. Apparently, the HotSpot compiler is unable to inline getObjectVolatile() unless it can
> verify thatits main argument will always be non-null.
Suggests this:
  https://bugs.openjdk.java.net/browse/JDK-8221355

You should really try up-to-date JDK.

--
Thanks,
-Aleksey

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest