Thoughts about LongAdder

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Thoughts about LongAdder

Nathan Reynolds-2
By way of introduction, I work at Oracle in the PSR team (Performance, Scalability and Reliability).  This team is focuses on optimizing the applications and middleware.  My task is to optimize Weblogic server, the JVM, kernel and network for Oracle's Exalogic, a rack of Westmere-EP servers.  Most of my focus is on Weblogic server and some what on the other areas as needed.

Monday, I came up with a "new" data structure, ConcurrentCounter, to take advantage of a processor id API I proposed to the Oracle HotSpot team.  Alan Bateman sent me a link to an email in this list to LongAdder.  I reviewed the LongAdder implementation and learned a bit from it.  I enhanced ConcurrentCounter based on some of the ideas in LongAdder.  Here's a list of ideas of where LongAdder and ConcurrentCounter differ.  Hopefully, this list will help improve the final result .
  1. I would recommend striping the counter *not* by thread but by processor/core.  This will result in almost 0 contention and significantly reduce cache coherence traffic.  If striped by thread, then the counter will probably consume more memory, the thread to Cell mapping may not stabilize and when a thread migrates to another processor, the cache lines have to be transferred.  If striped by processor, then thread migration has no impact (unless it was in the middle of updating a Cell).  Because of this, striping by processor will outperform striping by thread.  (More on this in another email.)  Also, the number of Cells required will be fixed to the number of processors which is almost always fewer than the number of threads in server processes.  The attached ConcurrentCounter stripes by processor.
  2. LongAdder pads each Cell to avoid false sharing.  Good idea.  However, the amount of memory required can be quite costly in terms of capacity, wasting cache space and bandwidth.  ConcurrentCounter (attached) assumes the JVM allocator is NUMA aware.  Thus, each "Cell" will be allocated in memory local to the processor.  False sharing shouldn't be possible and if it happens then the contention will only happen among sibling cores.  If GC is NUMA aware, then each "Cell" will remain in local memory.  Again, false sharing isn't likely to be a problem.  Thus, padding is hopefully not needed.  I am interested in results showing padding vs no padding... assuming all of the JVM requirements are met.
  3. reset() should replace the array of Cell instead of setting the values to 0.  This will increase GC and allocator load but reset() can then be atomic.  ConcurrentCounter.clear() implements this logic quite simply.
  4. getAndReset() should also replace the array of Cell.  The get() can be computed on the old array of Cell and should be much more accurate since the concurrent updates should start using the new array right away.  The inaccuracies will come from threads that context switched off while in the middle of an update and then resume later using the old array.  ConcurrentCounter.getAndClear() does this.
  5. ConcurrentCounter supports a set() API.  It basically does the reset() trick I mentioned above but the new set of "Cells" is already assigned the given value.
  6. ConcurrentCounter executes only 1 atomic instruction per update (in the main path) and never uses non-blocking locks.
  7. retryUpdate() is extremely complicated code (due to thread striping and balancing).  ConcurrentCounter doesn't have such complexity.
Nathan Reynolds | Consulting Member of Technical Staff | 602.333.9091
Oracle PSR Engineering | Server Technology

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

ConcurrentCounter.java (9K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts about LongAdder

Doug Lea
Sorry for the delay on this!

On 08/10/11 12:37, Nathan Reynolds wrote:
> Here's a list of ideas of where
> LongAdder and ConcurrentCounter differ.
>
>    1. I would recommend striping the counter *not* by thread but by
>       processor/core. This will result in almost 0 contention and significantly
>       reduce cache coherence traffic.

This would be a good idea if we had a ProcessorLocal class.
Which is probably a good idea. Here's a minimal sketch
Add
   class java.lang.Processor {
      int getId();
      int getMaxId();
      static int proximity(Processor a, Processor b); // metrics TBD
      // non-public ProcessorLocal table
   }
Add to class java.lang.Thread:
       Processor getProcessor();
Add
    class java.lang.ProcessorLocal {
      // mostly like ThreadLocal
   }

However, in the mean time, the internal randomization done
inside LongAdder gives performance that seems to closely approximate
what you might get by doing this.


>    2. LongAdder pads each Cell to avoid false sharing. Good idea. However, the
>       amount of memory required can be quite costly in terms of capacity,
>       wasting cache space and bandwidth. ConcurrentCounter (attached) assumes
>       the JVM allocator is NUMA aware.

I agree that padding is a bit hacky, but I don't know how a JVM
can know to do this without either the assertion of or evidence
of a variable being contended. The best suggestion I've heard on
this is to add an annotation @Contended.


>       I am interested in results showing padding vs no padding...
>       assuming all of the JVM requirements are met.

I see more than a factor of 6 slowdown on Intel i7 and AMD 6172 test machines.

>    3. reset() should replace the array of Cell instead of setting the values to
>       0. This will increase GC and allocator load but reset() can then be
>       atomic.

Thanks for the suggestion, but I don't know of use cases for which
reset(), but not sum(), need to be atomic -- the only use cases I
know are to reset when it is otherwise known that threads are quiescent.

Doing this would also disable the nice property of current LongAdder
that it allocates no extra space until contended but instead uses
"base" field.

>    5. ConcurrentCounter supports a set() API. It basically does the reset()
>       trick I mentioned above but the new set of "Cells" is already assigned the
>       given value.

Ditto.


>    7. retryUpdate() is extremely complicated code (due to thread striping and
>       balancing).


We do the complicated stuff so you don't have to  :-)

-Doug
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Thoughts about LongAdder

Nathan Reynolds-2
On 8/18/2011 1:41 PM, Doug Lea wrote:
Sorry for the delay on this!

On 08/10/11 12:37, Nathan Reynolds wrote:
Here's a list of ideas of where
LongAdder and ConcurrentCounter differ.

   1. I would recommend striping the counter *not* by thread but by
      processor/core. This will result in almost 0 contention and significantly
      reduce cache coherence traffic.

This would be a good idea if we had a ProcessorLocal class.
Which is probably a good idea. Here's a minimal sketch
Add
  class java.lang.Processor {
     int getId();
     int getMaxId();
     static int proximity(Processor a, Processor b); // metrics TBD
     // non-public ProcessorLocal table
  }
Add to class java.lang.Thread:
      Processor getProcessor();
Add
   class java.lang.ProcessorLocal {
     // mostly like ThreadLocal
  }
In another email I sent on 8/10/2011 at 3:50 pm, it has my attempt at ProcessorLocal and ProcessorIndex.  The latter class maps raw CPU/OS processor ID into a 0 to N-1 index.  It doesn't provide a Processor class or a proximity() metric.
However, in the mean time, the internal randomization done
inside LongAdder gives performance that seems to closely approximate
what you might get by doing this.


   2. LongAdder pads each Cell to avoid false sharing. Good idea. However, the
      amount of memory required can be quite costly in terms of capacity,
      wasting cache space and bandwidth. ConcurrentCounter (attached) assumes
      the JVM allocator is NUMA aware.

I agree that padding is a bit hacky, but I don't know how a JVM
can know to do this without either the assertion of or evidence
of a variable being contended. The best suggestion I've heard on
this is to add an annotation @Contended.


      I am interested in results showing padding vs no padding...
      assuming all of the JVM requirements are met.

I see more than a factor of 6 slowdown on Intel i7 and AMD 6172 test machines.

   3. reset() should replace the array of Cell instead of setting the values to
      0. This will increase GC and allocator load but reset() can then be
      atomic.

Thanks for the suggestion, but I don't know of use cases for which
reset(), but not sum(), need to be atomic -- the only use cases I
know are to reset when it is otherwise known that threads are quiescent.

Thanks for pointing that out.  I can't think of a use case either.  My implementation provided a simple way to make them atomic and so I did.

Doing this would also disable the nice property of current LongAdder
that it allocates no extra space until contended but instead uses
"base" field.

   5. ConcurrentCounter supports a set() API. It basically does the reset()
      trick I mentioned above but the new set of "Cells" is already assigned the
      given value.

Ditto.


   7. retryUpdate() is extremely complicated code (due to thread striping and
      balancing).


We do the complicated stuff so you don't have to  :-)


Oops.  I now realize what I said is a bit rude, but I didn't intend it as such.  Thank you for not taking offense.  What I mean to say is that ProcessorLocal will make LongAdder much simpler to implement.  But, you already understood that.

-Doug
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

Nathan Reynolds | Consulting Member of Technical Staff | 602.333.9091
Oracle PSR Engineering | Server Technology


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest