Overhead of ThreadLocal data

classic Classic list List threaded Threaded
64 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
This sounds VERY like the behaviour we observed - the WeakReferences not
getting cleared would make total sense for causing the linear scans.

On 10/19/18 8:09 AM, Gil Tene via Concurrency-interest wrote:

> The problem with “short lived” thread locals that do not explicitly use remove() when they are no longer needed, is that in those situations “short lived” means “until the next GC detects that the ThreadLocal is no longer reachable, and the weak reference to that ThreadLocal starts returning nulls”. Scanning for stale entries more frequently, or on an API call, won’t help these situations because the map entries are not “stale” until the collector determines the unreachability of their related ThreadLocal instances. When short lived ThreadLocals are created and die at a rate linear to application throughout, this leads to ThreadLocal maps in active threads holding thousands or millions of entries, to extremely long collision chains, and to situations where half the CPU is spent walking those chains in ThreadLocal.get().
>
> Where would such a ThreadLocal instance creation rate come from? We’ve seen it happen in the wild in quite a few places. I initially thought of it as an application or library-level misuse of a ThreadLocal, but with each instance of juc ReentrantReadWriteLock potentially using a ThreadLocal instance for internal coordination, application and library writers that use ReentrantReadWriteLock in a seemingly idiomatic way are often unaware of the ThreadLocal implications. A simple, seemingly valid pattern for using ReentrantReadWriteLock would be a system where arriving work “packets” are handled by parallel threads (doing parallel work within the work packet), where each work packet uses its own of ReentrantReadWriteLock instance for coordination of the parallelism within the packet.
>
> With G1 (and C4, and likely Shenandoah and ZGC), with the current ThreadLocal implementation, the “means “until the next GC detects that the ThreadLocal is no longer reachable” meaning gets compounded by the fact that collisions in the map will actually prevent otherwise-unreachable ThreadLocal instances from becoming unreachable. As long as get() calls for other (chain colliding) ThreadLocals are actively performed on any thread, where “actively” means “more than once during a mark cycle”, the weakrefs from the colliding entries get strengthened during the mark, preventing the colliding ThreadLocal instances from dying in the given GC cycle.. Stop-the-world newgen provides a bit of a filter for short-lived-enough ThreadLocals for G1, but for ThreadLocals with mid-term lifecycles (which will naturally occur as queues in a work system as described above grow under load), or without such a STW newgen (which none of the newer collectors have or want to have have), this situation leads to explosive ThreadLocal map growth under high-enough throughout situations.
>
> We’ve created a tweaked implementation of ThreadLocal that avoids the weakref get()-strengthening problem with no semantic change, by having ThreadLocal get() determine identity by comparing a ThreadLocal ID (a unique long value per ThreadLocal instance, assigned at ThreadLocal instantiation) rather than comparing the outcome of a weakref get() for each entry in the chain walked when looking for a match. This makes weakref get()s only occur on the actual entry you are looking up, and not on any colliding entries, this preventing the gets from keeping dead ThreadLocals alive. We’ve been using this implementation successfully with C4 in Zing, and would be happy to share and upstream if there is interest.
>
> Sent from my iPad
>
>> On Oct 19, 2018, at 6:45 AM, Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> wrote:
>>
>> A while ago, I raised the request for a processor local. Striping a highly contended cache line, reduces contention but the cache line still bounces around cores.  Using processor local, each cache line can be assigned to a core and it remains in the exclusive state for that core.  Atomic operations on that cache line are the fastest possible.
>>
>> My original example was a reader-writer lock which is almost exclusively used for read operations.  However, with processor local, we can implement uncontended counters for heavily used queues.
>>
>> -Nathan
>>
>>> On 10/19/2018 7:06 AM, Brian Goetz wrote:
>>> The problem is deeper than that; people using TL because it's what we've got, when really what people want (in various situations) is { processor, frame, task } locals, and TL is the best approximation we have.  Over in Project Loom, there's a deeper exploration going on of the use cases, and what additional mechanisms might help.
>>>
>>>> On 10/19/2018 6:12 AM, Peter Levart via Concurrency-interest wrote:
>>>> It is interesting that people have problems with short-lived ThreadLocals because ThreadLocal tries hard to expunge stale entries gradually during use. Even get() tries to do it if the access happens to miss the home slot. Repeated access to the same entry should not have to skip stale entries over and over again. There seems to be special use pattern(s) that provoke problems. Studying them more deeply could show us the weakness of current ThreadLocal auto-expunge strategy so it could be improved and new method would not be needed...
>>>>
>>>> Maybe the problem is not caused by unexpunged stale entries. It may be related to degeneration of hashtable that leads to sub-optimal placement of live entries. In that case, perhaps triggering a re-hash automatically when get() encounters many live entries it has to skip would help.
>>>>
>>>> So those with problems, please be more specific about them. Can you show us a reproducer?
>>>>
>>>> Regards, Peter
>>>>
>>>>> On 10/17/2018 08:26 PM, Doug Lea via Concurrency-interest wrote:
>>>>> [+list]
>>>>>
>>>>>> On 10/17/18 11:44 AM, Nathan and Ila Reynolds wrote:
>>>>>> Can we add the following method to ThreadLocal?
>>>>>>
>>>>>> public static void expungeStaleEntries()
>>>>> This seems like a reasonable request (although perhaps with an improved
>>>>> name).  The functionality exists internally, and it seems overly
>>>>> parental not to export it for use as a band-aid by those people who have
>>>>> tried and otherwise failed to solve the zillions of short-lived
>>>>> ThreadLocals in long-lived threads problem.
>>>>>
>>>>> Can anyone think of a reason not to do this?
>>>>>
>>>>> -Doug
>>>>>
>>>>>> This method will call ThreadLocal.ThreadLocalMap.expungeStaleEntries()
>>>>>> for the ThreadLocalMap of the current thread.  Thread pools can then
>>>>>> call this method when the thread finishes processing a job after GC.
>>>>>> This solves the problem of zillions of short-lived ThreadLocals in
>>>>>> long-lived threads.
>>>>> _______________________________________________
>>>>> Concurrency-interest mailing list
>>>>> [hidden email]
>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>
>>>>
>>>> _______________________________________________
>>>> Concurrency-interest mailing list
>>>> [hidden email]
>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>
>> _______________________________________________
>> Concurrency-interest mailing list
>> [hidden email]
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
I've posted a full OpenJDK8-based ThreadLocal.java implementation for reference here: https://github.com/giltene/GilExamples/blob/master/ModifiedThreadLocal/src/main/java/java/lang/ThreadLocal.java , and as noted, we'd be happy to upstream it (into 12?).

On Oct 19, 2018, at 9:45 AM, Gil Tene via Concurrency-interest <[hidden email]> wrote:



Sent from Gil's iPhone

On Oct 19, 2018, at 9:06 AM, Peter Levart <[hidden email]> wrote:

Hi Gil,

On 10/19/2018 05:09 PM, Gil Tene wrote:
We’ve created a tweaked implementation of ThreadLocal that avoids the weakref get()-strengthening problem with no semantic change, by having ThreadLocal get() determine identity by comparing a ThreadLocal ID (a unique long value per ThreadLocal instance, assigned at ThreadLocal instantiation) rather than comparing the outcome of a weakref get() for each entry in the chain walked when looking for a match. This makes weakref get()s only occur on the actual entry you are looking up, and not on any colliding entries, this preventing the gets from keeping dead ThreadLocals alive. We’ve been using this implementation successfully with C4 in Zing, and would be happy to share and upstream if there is interest.

Very interesting. One question though. How does your implementation do expunging of stale entries? Just when there is a re-hash needed because of capacity growing over threshold? I ask because in order to detect that and entry is stale, you have to call get(). The only other variant is to have the weakrefs enqueued and then poll the queue for them. But that already interacts with a reference handling thread and I had the impression that ThreadLocal is designed to have no synchronization whatsoever.

We use a reference queue, and poll on it to detect and deal with entries that (weakly) refereed to dead ThreadLocals.


Regards, Peter

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 10/19/18 11:09 AM, Gil Tene wrote:

>
> With G1 (and C4, and likely Shenandoah and ZGC), with the current
> ThreadLocal implementation, the “means “until the next GC detects
> that the ThreadLocal is no longer reachable” meaning gets compounded
> by the fact that collisions in the map will actually prevent
> otherwise-unreachable ThreadLocal instances from becoming
> unreachable.

Thanks for pointing this out. Yes, this should be addressed (the
implementation long predated such collectors). There's a simpler way
than you showed though. We can extend the Fibonacci hash to "long" to
avoid ever wrapping, define a tombstone value, and use as id to avoid
WeakRef.get calls on collision. The reduces a bit of the time/space
overhead that this scheme imposes.

I'm also a little suspicious about whether using a reference queue is
worth the overhead. Because GCs clear weak refs in bunches, cleaning a
few usually usually leads to finding more without need to poll. Any data
showing otherwise would be helpful.

I'm guessing that Peter Levart has had some similar thoughts; we'll work
off-list to see where these go.

-Doug


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
What about replacing ThreadLocal.ThreadLocalMap with WeakHashMap? 
WeakHashMap uses a ReferenceQueue to track stale entries and WeakHashMap
gets rid of stale entries with every access.  WeakHashMap checks the
entry's hash before checking the key for equality.  Using WeakHashMap
eliminates the cost of maintaining ThreadLocalMap which is essentially a
WeakHashMap.

-Nathan

On 10/19/2018 11:45 AM, Shevek wrote:

> This sounds VERY like the behaviour we observed - the WeakReferences
> not getting cleared would make total sense for causing the linear scans.
>
> On 10/19/18 8:09 AM, Gil Tene via Concurrency-interest wrote:
>> The problem with “short lived” thread locals that do not explicitly
>> use remove() when they are no longer needed, is that in those
>> situations “short lived” means “until the next GC detects that the
>> ThreadLocal is no longer reachable, and the weak reference to that
>> ThreadLocal starts returning nulls”. Scanning for stale entries more
>> frequently, or on an API call, won’t help these situations because
>> the map entries are not “stale” until the collector determines the
>> unreachability of their related ThreadLocal instances. When short
>> lived ThreadLocals are created and die at a rate linear to
>> application throughout, this leads to ThreadLocal maps in active
>> threads holding thousands or millions of entries, to extremely long
>> collision chains, and to situations where half the CPU is spent
>> walking those chains in ThreadLocal.get().
>>
>> Where would such a ThreadLocal instance creation rate come from?
>> We’ve seen it happen in the wild in quite a few places. I initially
>> thought of it as an application or library-level misuse of a
>> ThreadLocal, but with each instance of juc ReentrantReadWriteLock
>> potentially using a ThreadLocal instance for internal coordination,
>> application and library writers that use ReentrantReadWriteLock in a
>> seemingly idiomatic way are often unaware of the ThreadLocal
>> implications. A simple, seemingly valid pattern for using
>> ReentrantReadWriteLock would be a system where arriving work
>> “packets” are handled by parallel threads (doing parallel work within
>> the work packet), where each work packet uses its own of
>> ReentrantReadWriteLock instance for coordination of the parallelism
>> within the packet.
>>
>> With G1 (and C4, and likely Shenandoah and ZGC), with the current
>> ThreadLocal implementation, the “means “until the next GC detects
>> that the ThreadLocal is no longer reachable” meaning gets compounded
>> by the fact that collisions in the map will actually prevent
>> otherwise-unreachable ThreadLocal instances from becoming
>> unreachable. As long as get() calls for other (chain colliding)
>> ThreadLocals are actively performed on any thread, where “actively”
>> means “more than once during a mark cycle”, the weakrefs from the
>> colliding entries get strengthened during the mark, preventing the
>> colliding ThreadLocal instances from dying in the given GC cycle..
>> Stop-the-world newgen provides a bit of a filter for
>> short-lived-enough ThreadLocals for G1, but for ThreadLocals with
>> mid-term lifecycles (which will naturally occur as queues in a work
>> system as described above grow under load), or without such a STW
>> newgen (which none of the newer collectors have or want to have
>> have), this situation leads to explosive ThreadLocal map growth under
>> high-enough throughout situations.
>>
>> We’ve created a tweaked implementation of ThreadLocal that avoids the
>> weakref get()-strengthening problem with no semantic change, by
>> having ThreadLocal get() determine identity by comparing a
>> ThreadLocal ID (a unique long value per ThreadLocal instance,
>> assigned at ThreadLocal instantiation) rather than comparing the
>> outcome of a weakref get() for each entry in the chain walked when
>> looking for a match. This makes weakref get()s only occur on the
>> actual entry you are looking up, and not on any colliding entries,
>> this preventing the gets from keeping dead ThreadLocals alive. We’ve
>> been using this implementation successfully with C4 in Zing, and
>> would be happy to share and upstream if there is interest.
>>
>> Sent from my iPad
>>
>>> On Oct 19, 2018, at 6:45 AM, Nathan and Ila Reynolds via
>>> Concurrency-interest <[hidden email]> wrote:
>>>
>>> A while ago, I raised the request for a processor local. Striping a
>>> highly contended cache line, reduces contention but the cache line
>>> still bounces around cores.  Using processor local, each cache line
>>> can be assigned to a core and it remains in the exclusive state for
>>> that core.  Atomic operations on that cache line are the fastest
>>> possible.
>>>
>>> My original example was a reader-writer lock which is almost
>>> exclusively used for read operations.  However, with processor
>>> local, we can implement uncontended counters for heavily used queues.
>>>
>>> -Nathan
>>>
>>>> On 10/19/2018 7:06 AM, Brian Goetz wrote:
>>>> The problem is deeper than that; people using TL because it's what
>>>> we've got, when really what people want (in various situations) is
>>>> { processor, frame, task } locals, and TL is the best approximation
>>>> we have.  Over in Project Loom, there's a deeper exploration going
>>>> on of the use cases, and what additional mechanisms might help.
>>>>
>>>>> On 10/19/2018 6:12 AM, Peter Levart via Concurrency-interest wrote:
>>>>> It is interesting that people have problems with short-lived
>>>>> ThreadLocals because ThreadLocal tries hard to expunge stale
>>>>> entries gradually during use. Even get() tries to do it if the
>>>>> access happens to miss the home slot. Repeated access to the same
>>>>> entry should not have to skip stale entries over and over again.
>>>>> There seems to be special use pattern(s) that provoke problems.
>>>>> Studying them more deeply could show us the weakness of current
>>>>> ThreadLocal auto-expunge strategy so it could be improved and new
>>>>> method would not be needed...
>>>>>
>>>>> Maybe the problem is not caused by unexpunged stale entries. It
>>>>> may be related to degeneration of hashtable that leads to
>>>>> sub-optimal placement of live entries. In that case, perhaps
>>>>> triggering a re-hash automatically when get() encounters many live
>>>>> entries it has to skip would help.
>>>>>
>>>>> So those with problems, please be more specific about them. Can
>>>>> you show us a reproducer?
>>>>>
>>>>> Regards, Peter
>>>>>
>>>>>> On 10/17/2018 08:26 PM, Doug Lea via Concurrency-interest wrote:
>>>>>> [+list]
>>>>>>
>>>>>>> On 10/17/18 11:44 AM, Nathan and Ila Reynolds wrote:
>>>>>>> Can we add the following method to ThreadLocal?
>>>>>>>
>>>>>>> public static void expungeStaleEntries()
>>>>>> This seems like a reasonable request (although perhaps with an
>>>>>> improved
>>>>>> name).  The functionality exists internally, and it seems overly
>>>>>> parental not to export it for use as a band-aid by those people
>>>>>> who have
>>>>>> tried and otherwise failed to solve the zillions of short-lived
>>>>>> ThreadLocals in long-lived threads problem.
>>>>>>
>>>>>> Can anyone think of a reason not to do this?
>>>>>>
>>>>>> -Doug
>>>>>>
>>>>>>> This method will call
>>>>>>> ThreadLocal.ThreadLocalMap.expungeStaleEntries()
>>>>>>> for the ThreadLocalMap of the current thread.  Thread pools can
>>>>>>> then
>>>>>>> call this method when the thread finishes processing a job after
>>>>>>> GC.
>>>>>>> This solves the problem of zillions of short-lived ThreadLocals in
>>>>>>> long-lived threads.
>>>>>> _______________________________________________
>>>>>> Concurrency-interest mailing list
>>>>>> [hidden email]
>>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Concurrency-interest mailing list
>>>>> [hidden email]
>>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>>>
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>> [hidden email]
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>> _______________________________________________
>> Concurrency-interest mailing list
>> [hidden email]
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
1234