Overhead of ThreadLocal data

classic Classic list List threaded Threaded
64 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Overhead of ThreadLocal data

JSR166 Concurrency mailing list
Some of you might be interested to know the overhead of the "fast
path" of ThreadLocals. Some of you might be terrified of the VM
innards and panic when you see assembly code: this post is not for
you. Everybody else, read on...


Here's what happens when you say int n = ThreadLocal<Integer>::get :

         ││ ;; B11: # B24 B12 <- B3 B10 Loop: B11-B10 inner  Freq: 997.566
 10.51%  ││  0x000003ff68b38170: ldr x10, [xthread,#856]         ;*invokestatic currentThread {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal::get@0 (line 162)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)

## Read the pointer to java.lang.Thread from thread metadata.

         ││  0x000003ff68b38174: ldr w11, [x10,#76]
         ││  0x000003ff68b38178: lsl x24, x11, #3            ;*getfield threadLocals {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal::getMap@1 (line 254)
         ││                                                            ; - java.lang.ThreadLocal::get@6 (line 163)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)
  2.12%  ││  0x000003ff68b3817c: cbz x24, 0x000003ff68b3824c  ;*ifnull {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal::get@11 (line 164)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)

## Read the pointer to Thread.threadLocals from the Thread. Check it's
   not zero.

         ││ ;; B12: # B32 B13 <- B11  Freq: 997.551
         ││  0x000003ff68b38180: ldr w10, [x24,#20]
         ││  0x000003ff68b38184: lsl x10, x10, #3                ;*getfield table {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal$ThreadLocalMap::getEntry@5 (line 434)
         ││                                                            ; - java.lang.ThreadLocal::get@16 (line 165)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)

## Read the pointer to ThreadLocals.table from Thread.threadLocals


  2.12%  ││  0x000003ff68b38188: ldr w11, [x10,#12]              ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal$ThreadLocalMap::getEntry@8 (line 434)
         ││                                                            ; - java.lang.ThreadLocal::get@16 (line 165)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)
         ││                                                            ; implicit exception: dispatches to 0x000003ff68b38324

## Read the length field from table

         ││ ;; B13: # B26 B14 <- B12  Freq: 997.55
         ││  0x000003ff68b3818c: ldr w13, [x23,#12]

## Read ThreadLocal.threadLocalHashCode

         ││  0x000003ff68b38190: sub w12, w11, #0x1
         ││  0x000003ff68b38194: and w20, w13, w12               ;*iand {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal$ThreadLocalMap::getEntry@11 (line 434)
         ││                                                            ; - java.lang.ThreadLocal::get@16 (line 165)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)
  1.37%  ││  0x000003ff68b38198: add x12, x10, w20, sxtw #2

## int i = key.threadLocalHashCode & (table.length - 1);


         ││  0x000003ff68b3819c: cmp w11, #0x0
         ││  0x000003ff68b381a0: b.ls 0x000003ff68b38270

## make sure table.length is not <= 0. (Can't happen, but VM doesn't know that.)

         ││ ;; B14: # B20 B15 <- B13  Freq: 997.549
 11.88%  ││  0x000003ff68b381a4: ldr w10, [x12,#16]

## Entry e = table[i];

         ││  0x000003ff68b381a8: lsl x25, x10, #3          ;*aaload {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal$ThreadLocalMap::getEntry@18 (line 435)
         ││                                                            ; - java.lang.ThreadLocal::get@16 (line 165)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)
         ││  0x000003ff68b381ac: cbz x25, 0x000003ff68b38208  ;*ifnull {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal$ThreadLocalMap::getEntry@21 (line 436)
         ││                                                            ; - java.lang.ThreadLocal::get@16 (line 165)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)

## if (e != null

         ││ ;; B15: # B5 B16 <- B14  Freq: 997.489
  6.37%  ││  0x000003ff68b381b0: ldr w11, [x25,#12]
         ││  0x000003ff68b381b4: ldrsb w10, [xthread,#48]
         ││  0x000003ff68b381b8: lsl x26, x11, #3
  9.41%  ╰│  0x000003ff68b381bc: cbz w10, 0x000003ff68b38130

## G1 garbage collector special case for weak references: if we're
   doing parallel marking, take a slow path.

            ;; B5: # B27 B6 <- B18 B4 B16 B15  top-of-loop Freq: 997.489
 24.71%  ↗↗  0x000003ff68b38130: cmp x26, x23
         ││  0x000003ff68b38134: b.ne 0x000003ff68b38294          ;*invokevirtual getEntry {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal::get@16 (line 165)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)

##  && e.get() == key)

         ││ ;; B6: # B7 <- B5 B22  Freq: 997.549
         ││  0x000003ff68b38138: ldr w11, [x25,#28]
         ││  0x000003ff68b3813c: lsl x0, x11, #3                 ;*invokevirtual get {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)

## We now have our ThreadLocal.

         ││ ;; B7: # B31 B8 <- B6 B25  Freq: 997.564
  1.71%  ││  0x000003ff68b38140: ldr w11, [x0,#8]                ; implicit exception: dispatches to 0x000003ff68b3830c
         ││ ;; B8: # B30 B9 <- B7  Freq: 997.563
         ││  0x000003ff68b38144: mov x12, #0x10000               // #65536
         ││                                                            ;   {metadata(&apos;java/lang/Integer&apos;)}
         ││  0x000003ff68b38148: movk x12, #0x3de8
         ││  0x000003ff68b3814c: cmp w11, w12
  3.15%  ││  0x000003ff68b38150: b.ne 0x000003ff68b382f4          ;*checkcast {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - org.sample.ThreadLocalTest::floss@17 (line 31)

## checkcast to make sure it really is an Integer.


         ││ ;; B9: # B28 B10 <- B8  Freq: 997.563
         ││  0x000003ff68b38154: ldr w10, [x0,#12]               ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.Integer::intValue@1 (line 1132)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@20 (line 31)

Read the int field. We're done.

12 field loads, 5 conditional branches. That's the overhead of a
single ThreadLocal.get(). Conditional branches depending on the result
of a load from memory are expensive, and we have a lot of them.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
That's super nice and somehow expected: that's why some (in this same list) have suggested to extends Thread and provide an ad hoc field directly I suppose...
Many thanks for this analysis!!!!

Il giorno mer 17 ott 2018 alle ore 13:58 Andrew Haley via Concurrency-interest <[hidden email]> ha scritto:
Some of you might be interested to know the overhead of the "fast
path" of ThreadLocals. Some of you might be terrified of the VM
innards and panic when you see assembly code: this post is not for
you. Everybody else, read on...


Here's what happens when you say int n = ThreadLocal<Integer>::get :

         ││ ;; B11: #   B24 B12 <- B3 B10       Loop: B11-B10 inner  Freq: 997.566
 10.51%  ││  0x000003ff68b38170: ldr    x10, [xthread,#856]         ;*invokestatic currentThread {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal::get@0 (line 162)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)

## Read the pointer to java.lang.Thread from thread metadata.

         ││  0x000003ff68b38174: ldr    w11, [x10,#76]
         ││  0x000003ff68b38178: lsl    x24, x11, #3            ;*getfield threadLocals {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal::getMap@1 (line 254)
         ││                                                            ; - java.lang.ThreadLocal::get@6 (line 163)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)
  2.12%  ││  0x000003ff68b3817c: cbz    x24, 0x000003ff68b3824c  ;*ifnull {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal::get@11 (line 164)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)

## Read the pointer to Thread.threadLocals from the Thread. Check it's
   not zero.

         ││ ;; B12: #   B32 B13 <- B11  Freq: 997.551
         ││  0x000003ff68b38180: ldr    w10, [x24,#20]
         ││  0x000003ff68b38184: lsl    x10, x10, #3                ;*getfield table {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal$ThreadLocalMap::getEntry@5 (line 434)
         ││                                                            ; - java.lang.ThreadLocal::get@16 (line 165)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)

## Read the pointer to ThreadLocals.table from Thread.threadLocals


  2.12%  ││  0x000003ff68b38188: ldr    w11, [x10,#12]              ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal$ThreadLocalMap::getEntry@8 (line 434)
         ││                                                            ; - java.lang.ThreadLocal::get@16 (line 165)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)
         ││                                                            ; implicit exception: dispatches to 0x000003ff68b38324

## Read the length field from table

         ││ ;; B13: #   B26 B14 <- B12  Freq: 997.55
         ││  0x000003ff68b3818c: ldr    w13, [x23,#12]

## Read ThreadLocal.threadLocalHashCode

         ││  0x000003ff68b38190: sub    w12, w11, #0x1
         ││  0x000003ff68b38194: and    w20, w13, w12               ;*iand {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal$ThreadLocalMap::getEntry@11 (line 434)
         ││                                                            ; - java.lang.ThreadLocal::get@16 (line 165)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)
  1.37%  ││  0x000003ff68b38198: add    x12, x10, w20, sxtw #2

## int i = key.threadLocalHashCode & (table.length - 1);


         ││  0x000003ff68b3819c: cmp    w11, #0x0
         ││  0x000003ff68b381a0: b.ls   0x000003ff68b38270

## make sure table.length is not <= 0. (Can't happen, but VM doesn't know that.)

         ││ ;; B14: #   B20 B15 <- B13  Freq: 997.549
 11.88%  ││  0x000003ff68b381a4: ldr    w10, [x12,#16]

## Entry e = table[i];

         ││  0x000003ff68b381a8: lsl    x25, x10, #3          ;*aaload {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal$ThreadLocalMap::getEntry@18 (line 435)
         ││                                                            ; - java.lang.ThreadLocal::get@16 (line 165)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)
         ││  0x000003ff68b381ac: cbz    x25, 0x000003ff68b38208  ;*ifnull {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal$ThreadLocalMap::getEntry@21 (line 436)
         ││                                                            ; - java.lang.ThreadLocal::get@16 (line 165)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)

## if (e != null

         ││ ;; B15: #   B5 B16 <- B14  Freq: 997.489
  6.37%  ││  0x000003ff68b381b0: ldr    w11, [x25,#12]
         ││  0x000003ff68b381b4: ldrsb  w10, [xthread,#48]
         ││  0x000003ff68b381b8: lsl    x26, x11, #3
  9.41%  ╰│  0x000003ff68b381bc: cbz    w10, 0x000003ff68b38130

## G1 garbage collector special case for weak references: if we're
   doing parallel marking, take a slow path.

            ;; B5: #    B27 B6 <- B18 B4 B16 B15  top-of-loop Freq: 997.489
 24.71%  ↗↗  0x000003ff68b38130: cmp    x26, x23
         ││  0x000003ff68b38134: b.ne   0x000003ff68b38294          ;*invokevirtual getEntry {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.ThreadLocal::get@16 (line 165)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)

##  && e.get() == key)

         ││ ;; B6: #    B7 <- B5 B22  Freq: 997.549
         ││  0x000003ff68b38138: ldr    w11, [x25,#28]
         ││  0x000003ff68b3813c: lsl    x0, x11, #3                 ;*invokevirtual get {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - org.sample.ThreadLocalTest::floss@14 (line 31)

## We now have our ThreadLocal.

         ││ ;; B7: #    B31 B8 <- B6 B25  Freq: 997.564
  1.71%  ││  0x000003ff68b38140: ldr    w11, [x0,#8]                ; implicit exception: dispatches to 0x000003ff68b3830c
         ││ ;; B8: #    B30 B9 <- B7  Freq: 997.563
         ││  0x000003ff68b38144: mov    x12, #0x10000                   // #65536
         ││                                                            ;   {metadata(&apos;java/lang/Integer&apos;)}
         ││  0x000003ff68b38148: movk   x12, #0x3de8
         ││  0x000003ff68b3814c: cmp    w11, w12
  3.15%  ││  0x000003ff68b38150: b.ne   0x000003ff68b382f4          ;*checkcast {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - org.sample.ThreadLocalTest::floss@17 (line 31)

## checkcast to make sure it really is an Integer.


         ││ ;; B9: #    B28 B10 <- B8  Freq: 997.563
         ││  0x000003ff68b38154: ldr    w10, [x0,#12]               ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
         ││                                                            ; - java.lang.Integer::intValue@1 (line 1132)
         ││                                                            ; - org.sample.ThreadLocalTest::floss@20 (line 31)

Read the int field. We're done.

12 field loads, 5 conditional branches. That's the overhead of a
single ThreadLocal.get(). Conditional branches depending on the result
of a load from memory are expensive, and we have a lot of them.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On Wed, Oct 17, 2018 at 6:58 AM Andrew Haley via Concurrency-interest
<[hidden email]> wrote:
>
> Some of you might be interested to know the overhead of the "fast
> path" of ThreadLocals. Some of you might be terrified of the VM
> innards and panic when you see assembly code: this post is not for
> you. Everybody else, read on...
> [...]
> 12 field loads, 5 conditional branches. That's the overhead of a
> single ThreadLocal.get(). Conditional branches depending on the result
> of a load from memory are expensive, and we have a lot of them.

It sure would be nice to have static thread local fields, similarly to
C/C++ thread/_Thread_local, where there are just one or two pointer
indirections to get at the data even in dynamically-loaded class
loaders.

That said, I don't think there's any good way to implement
thread-local instance fields; at least, not anything that's likely to
be much better than the existing ThreadLocal situation.  I would love
to be proven wrong though.
--
- DML
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
On 10/17/2018 01:47 PM, David Lloyd wrote:
> It sure would be nice to have static thread local fields, similarly to
> C/C++ thread/_Thread_local, where there are just one or two pointer
> indirections to get at the data even in dynamically-loaded class
> loaders.

Mmm. I wonder if we could piggyback this on Project Panama (or
Valhalla?)

> That said, I don't think there's any good way to implement
> thread-local instance fields; at least, not anything that's likely to
> be much better than the existing ThreadLocal situation.  I would love
> to be proven wrong though.

I can think of some cheaper ways for static thread locals. You'd
wouldn't have features like inheritance and creation on first use, but
you'd gain performance.

Here's how: Thread.currentThread() is extremely fast on all of the
machines we care about, it's usually a single instruction. So, we'd
need a growable per-thread map from static thread-local field ID to
the field contents: that's not terribly difficult. It should be
possible to do all of that in fewer than ten instructions, maybe only
five. If C++ can do it, so can we.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
I wonder why such a focus on ThreadLocals? Is this the recommended way to capture state?


Alex


> On 17 Oct 2018, at 15:01, Andrew Haley via Concurrency-interest <[hidden email]> wrote:
>
> On 10/17/2018 01:47 PM, David Lloyd wrote:
>> It sure would be nice to have static thread local fields, similarly to
>> C/C++ thread/_Thread_local, where there are just one or two pointer
>> indirections to get at the data even in dynamically-loaded class
>> loaders.
>
> Mmm. I wonder if we could piggyback this on Project Panama (or
> Valhalla?)
>
>> That said, I don't think there's any good way to implement
>> thread-local instance fields; at least, not anything that's likely to
>> be much better than the existing ThreadLocal situation.  I would love
>> to be proven wrong though.
>
> I can think of some cheaper ways for static thread locals. You'd
> wouldn't have features like inheritance and creation on first use, but
> you'd gain performance.
>
> Here's how: Thread.currentThread() is extremely fast on all of the
> machines we care about, it's usually a single instruction. So, we'd
> need a growable per-thread map from static thread-local field ID to
> the field contents: that's not terribly difficult. It should be
> possible to do all of that in fewer than ten instructions, maybe only
> five. If C++ can do it, so can we.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 10/17/18 10:01 AM, Andrew Haley via Concurrency-interest wrote:

> On 10/17/2018 01:47 PM, David Lloyd wrote:
>> It sure would be nice to have static thread local fields, similarly
>> to C/C++ thread/_Thread_local, where there are just one or two
>> pointer indirections to get at the data even in dynamically-loaded
>> class loaders.
>
> Mmm. I wonder if we could piggyback this on Project Panama (or
> Valhalla?)
>
>> That said, I don't think there's any good way to implement
>> thread-local instance fields; at least, not anything that's likely
>> to be much better than the existing ThreadLocal situation.  I would
>> love to be proven wrong though.
>
> I can think of some cheaper ways for static thread locals. You'd
> wouldn't have features like inheritance and creation on first use,
> but you'd gain performance.
>
> Here's how: Thread.currentThread() is extremely fast on all of the
> machines we care about, it's usually a single instruction. So, we'd
> need a growable per-thread map from static thread-local field ID to
> the field contents: that's not terribly difficult. It should be
> possible to do all of that in fewer than ten instructions, maybe
> only five. If C++ can do it, so can we.
>

I agree that Panama/Valhalla makes possible things like this that were
dismissed years ago when we contemplated further improvements.

This alone would not address long-standing problems with zillions of
short-lived ThreadLocals in long-lived threads.  Right now, ThreadLocal
is as fast as we know how to make it while still not completely falling
over under such usages. The only solution I know for this is to create a
new GC-aware storage class, which is not very likely to be adopted.

In the mean time, I agree with the suggestions of, when possible
creating Thread subclasses, and/or calling ThreadLocal.remove. Also
consider restructuring code to use task-local classes (perhaps with
linkages among them) that are GCable when tasks complete.

-Doug
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
@alex I suppose that there are several concurrent algorithms that are simpler to be implemented using it eg Combiner

Il giorno mer 17 ott 2018 alle ore 17:12 Alex Otenko via Concurrency-interest <[hidden email]> ha scritto:
I wonder why such a focus on ThreadLocals? Is this the recommended way to capture state?


Alex


> On 17 Oct 2018, at 15:01, Andrew Haley via Concurrency-interest <[hidden email]> wrote:
>
> On 10/17/2018 01:47 PM, David Lloyd wrote:
>> It sure would be nice to have static thread local fields, similarly to
>> C/C++ thread/_Thread_local, where there are just one or two pointer
>> indirections to get at the data even in dynamically-loaded class
>> loaders.
>
> Mmm. I wonder if we could piggyback this on Project Panama (or
> Valhalla?)
>
>> That said, I don't think there's any good way to implement
>> thread-local instance fields; at least, not anything that's likely to
>> be much better than the existing ThreadLocal situation.  I would love
>> to be proven wrong though.
>
> I can think of some cheaper ways for static thread locals. You'd
> wouldn't have features like inheritance and creation on first use, but
> you'd gain performance.
>
> Here's how: Thread.currentThread() is extremely fast on all of the
> machines we care about, it's usually a single instruction. So, we'd
> need a growable per-thread map from static thread-local field ID to
> the field contents: that's not terribly difficult. It should be
> possible to do all of that in fewer than ten instructions, maybe only
> five. If C++ can do it, so can we.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On Wed, Oct 17, 2018 at 11:28 AM Doug Lea via Concurrency-interest <[hidden email]> wrote:
Also consider restructuring code to use task-local classes (perhaps with
linkages among them) that are GCable when tasks complete.

I bet a lot of the ThreadLocal uses under consideration would benefit from this kind of restructuring, avoiding ThreadLocal entirely.

--tim 

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
On Wed, Oct 17, 2018 at 11:01 AM Tim Peierls via Concurrency-interest
<[hidden email]> wrote:
> On Wed, Oct 17, 2018 at 11:28 AM Doug Lea via Concurrency-interest <[hidden email]> wrote:
>> Also consider restructuring code to use task-local classes (perhaps with
>> linkages among them) that are GCable when tasks complete.
>
> I bet a lot of the ThreadLocal uses under consideration would benefit from this kind of restructuring, avoiding ThreadLocal entirely.

How would one access task-local data if not by way of a ThreadLocal?
--
- DML
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 10/17/2018 04:07 PM, Doug Lea via Concurrency-interest wrote:
> This alone would not address long-standing problems with zillions of
> short-lived ThreadLocals in long-lived threads.  Right now, ThreadLocal
> is as fast as we know how to make it while still not completely falling
> over under such usages. The only solution I know for this is to create a
> new GC-aware storage class, which is not very likely to be adopted.

Well, yeah, but creating zillions of short-lived ThreadLocals seems
like an antipattern to me.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 10/17/2018 03:53 PM, Alex Otenko wrote:
> I wonder why such a focus on ThreadLocals? Is this the recommended way to capture state?

It seems seductively simple. However, it's going to break very badly
with new concurrency models like fibers, so subclassing seems like a
better plan.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 10/17/2018 05:10 PM, David Lloyd via Concurrency-interest wrote:
> How would one access task-local data if not by way of a ThreadLocal?

((MyThread)(Thread.currentThread()).getFoo()

would work, surely. It's true that you'd need to know that the
current thread was an instance of MyThread.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
I agree. For your amusement, when I first designed and implemented ThreadLocal, I assumed that no VM would ever have more that ~10 thread locals over its lifetime. We reimplemented it several times as my initial estimate proved further and further from the truth. The API has held up pretty well, though.

Josh

On Wed, Oct 17, 2018 at 12:45 PM Andrew Haley via Concurrency-interest <[hidden email]> wrote:
On 10/17/2018 04:07 PM, Doug Lea via Concurrency-interest wrote:
> This alone would not address long-standing problems with zillions of
> short-lived ThreadLocals in long-lived threads.  Right now, ThreadLocal
> is as fast as we know how to make it while still not completely falling
> over under such usages. The only solution I know for this is to create a
> new GC-aware storage class, which is not very likely to be adopted.

Well, yeah, but creating zillions of short-lived ThreadLocals seems
like an antipattern to me.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On Wed, Oct 17, 2018 at 12:11 PM David Lloyd <[hidden email]> wrote:
On Wed, Oct 17, 2018 at 11:01 AM Tim Peierls wrote:
> On Wed, Oct 17, 2018 at 11:28 AM Doug Lea via Concurrency-interest <[hidden email]> wrote:
>> Also consider restructuring code to use task-local classes (perhaps with linkages among them) that are GCable when tasks complete.
>
> I bet a lot of the ThreadLocal uses under consideration would benefit from this kind of restructuring, avoiding ThreadLocal entirely.

How would one access task-local data if not by way of a ThreadLocal?

I'm thinking of settings where a task instance is confined to a single thread, so that the fields of that instance are safely accessible and modifiable without need of any other machinery.

I suspect that it is common for people to use per-thread state simply because it's easy, rather than trying to structure things in terms of tasks that run in a single thread at a time (with happens-before edges wherever the task is passed to another thread).

--tim

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
 > creating zillions of short-lived ThreadLocals seems like an
antipattern to me.

Perhaps, you can share another way to solve this problem.

I have a ByteBuffer that maps a large file.  I have multiple threads
reading the ByteBuffer at different positions.  As long as the threads
don't call ByteBuffer.position(), they can operate concurrently on the
ByteBuffer.  However, ByteBuffer.get(byte[]) does not have an absolute
method hence the thread has to call position().

Attempt #1: I started by putting a lock around the ByteBuffer. This
causes a lot of contention.

Attempt #2: I started by slicing the ByteBuffer.  This created a lot of
garbage.

Attempt #3: I put the sliced ByteBuffers into ThreadLocal but with many
files mapped, consumed and unmapped rapidly, this leads to zillions of
short-lived ThreadLocals.

Attempt #4: I put the sliced ByteBuffers into a LinkedTransferQueue but
this created a lot of garbage for creating nodes in the queue.

Attempt #5: I put the sliced ByteBuffers into a ConcurrentHashMap keyed
on the Thread.  I cannot remember why this didn't work.  I think the
overhead of ConcurrentHashMap created a lot of garbage.

Attempt #6: I went back to attempt #3 (ThreadLocal) and call expunge
when the thread returns to the thread pool.  Yes, this creates zillions
of short-lived ThreadLocals but they get cleaned out quickly so there is
performance degradation for ThreadLocal lookup.

Each thread cannot have its sliced ByteBuffer passed through the stack
as an argument.  This would create a lot of garbage from duplicate
structures.

-Nathan

On 10/17/2018 10:29 AM, Andrew Haley via Concurrency-interest wrote:
> On 10/17/2018 04:07 PM, Doug Lea via Concurrency-interest wrote:
>> This alone would not address long-standing problems with zillions of
>> short-lived ThreadLocals in long-lived threads.  Right now, ThreadLocal
>> is as fast as we know how to make it while still not completely falling
>> over under such usages. The only solution I know for this is to create a
>> new GC-aware storage class, which is not very likely to be adopted.
> Well, yeah, but creating zillions of short-lived ThreadLocals seems
> like an antipattern to me.
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On Wed, Oct 17, 2018 at 12:01 PM Tim Peierls <[hidden email]> wrote:

> On Wed, Oct 17, 2018 at 12:11 PM David Lloyd <[hidden email]> wrote:
>> On Wed, Oct 17, 2018 at 11:01 AM Tim Peierls wrote:
>> > On Wed, Oct 17, 2018 at 11:28 AM Doug Lea via Concurrency-interest <[hidden email]> wrote:
>> >> Also consider restructuring code to use task-local classes (perhaps with linkages among them) that are GCable when tasks complete.
>> >
>> > I bet a lot of the ThreadLocal uses under consideration would benefit from this kind of restructuring, avoiding ThreadLocal entirely.
>>
>> How would one access task-local data if not by way of a ThreadLocal?
>
> I'm thinking of settings where a task instance is confined to a single thread, so that the fields of that instance are safely accessible and modifiable without need of any other machinery.

That's fair, but I'm not talking about safety; I was asking
specifically how task-local classes can be accessible at all.  One can
allocate objects on the heap, and use them either within a single
lexical scope, pass them among methods as explicit method parameters,
or passed implicitly between methods by direct or indirect usage of a
thread local (either ThreadLocal or through some other path from
Thread to the data in question, including fields on a Thread
subclass).

Doug's mention of "task local classes" sounds like he was alluding to
some new mode of access other than these three, so I would be
interested to know of such a thing.

> I suspect that it is common for people to use per-thread state simply because it's easy, rather than trying to structure things in terms of tasks that run in a single thread at a time (with happens-before edges wherever the task is passed to another thread).

Maybe.

--
- DML
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
I was about to say that sharing Buffers would qualify as an example of a good use of ThreadLocals.

The important thing here is not having concurrent way of obtaining a preallocated Buffer, but the ability to use the Buffer whose content is very likely not shared with any other CPU. (Yes, the thread may migrate, but generally speaking...)

Alex

> On 17 Oct 2018, at 18:06, Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> wrote:
>
> > creating zillions of short-lived ThreadLocals seems like an antipattern to me.
>
> Perhaps, you can share another way to solve this problem.
>
> I have a ByteBuffer that maps a large file.  I have multiple threads reading the ByteBuffer at different positions.  As long as the threads don't call ByteBuffer.position(), they can operate concurrently on the ByteBuffer.  However, ByteBuffer.get(byte[]) does not have an absolute method hence the thread has to call position().
>
> Attempt #1: I started by putting a lock around the ByteBuffer. This causes a lot of contention.
>
> Attempt #2: I started by slicing the ByteBuffer.  This created a lot of garbage.
>
> Attempt #3: I put the sliced ByteBuffers into ThreadLocal but with many files mapped, consumed and unmapped rapidly, this leads to zillions of short-lived ThreadLocals.
>
> Attempt #4: I put the sliced ByteBuffers into a LinkedTransferQueue but this created a lot of garbage for creating nodes in the queue.
>
> Attempt #5: I put the sliced ByteBuffers into a ConcurrentHashMap keyed on the Thread.  I cannot remember why this didn't work.  I think the overhead of ConcurrentHashMap created a lot of garbage.
>
> Attempt #6: I went back to attempt #3 (ThreadLocal) and call expunge when the thread returns to the thread pool.  Yes, this creates zillions of short-lived ThreadLocals but they get cleaned out quickly so there is performance degradation for ThreadLocal lookup.
>
> Each thread cannot have its sliced ByteBuffer passed through the stack as an argument.  This would create a lot of garbage from duplicate structures.
>
> -Nathan
>
> On 10/17/2018 10:29 AM, Andrew Haley via Concurrency-interest wrote:
>> On 10/17/2018 04:07 PM, Doug Lea via Concurrency-interest wrote:
>>> This alone would not address long-standing problems with zillions of
>>> short-lived ThreadLocals in long-lived threads.  Right now, ThreadLocal
>>> is as fast as we know how to make it while still not completely falling
>>> over under such usages. The only solution I know for this is to create a
>>> new GC-aware storage class, which is not very likely to be adopted.
>> Well, yeah, but creating zillions of short-lived ThreadLocals seems
>> like an antipattern to me.
>>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On Wed, Oct 17, 2018 at 12:36 PM Nathan and Ila Reynolds via
Concurrency-interest <[hidden email]> wrote:
> Perhaps, you can share another way to solve this problem.
>
> I have a ByteBuffer that maps a large file.  I have multiple threads
> reading the ByteBuffer at different positions.  As long as the threads
> don't call ByteBuffer.position(), they can operate concurrently on the
> ByteBuffer.  However, ByteBuffer.get(byte[]) does not have an absolute
> method hence the thread has to call position().

The obvious solution would seem to be that we should enhance
ByteBuffer to have such a method.

But, your #2 should work if you are careful to do it like this:

   buf.duplicate().position(newPos).get(byteArray);

In such cases, HotSpot can sometimes delete the allocation of the new
ByteBuffer altogether.  I seem to recall that my colleague Andrew
Haley (on this thread) did some work/research in this area a while
ago.

--
- DML
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
You are fighting with ByteBuffer, which is not my favorite API. You do not need fancy thread-locals; you need a decent memory-mapped file.

On Wed, Oct 17, 2018 at 1:35 PM Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> wrote:
 > creating zillions of short-lived ThreadLocals seems like an
antipattern to me.

Perhaps, you can share another way to solve this problem.

I have a ByteBuffer that maps a large file.  I have multiple threads
reading the ByteBuffer at different positions.  As long as the threads
don't call ByteBuffer.position(), they can operate concurrently on the
ByteBuffer.  However, ByteBuffer.get(byte[]) does not have an absolute
method hence the thread has to call position().

Attempt #1: I started by putting a lock around the ByteBuffer. This
causes a lot of contention.

Attempt #2: I started by slicing the ByteBuffer.  This created a lot of
garbage.

Attempt #3: I put the sliced ByteBuffers into ThreadLocal but with many
files mapped, consumed and unmapped rapidly, this leads to zillions of
short-lived ThreadLocals.

Attempt #4: I put the sliced ByteBuffers into a LinkedTransferQueue but
this created a lot of garbage for creating nodes in the queue.

Attempt #5: I put the sliced ByteBuffers into a ConcurrentHashMap keyed
on the Thread.  I cannot remember why this didn't work.  I think the
overhead of ConcurrentHashMap created a lot of garbage.

Attempt #6: I went back to attempt #3 (ThreadLocal) and call expunge
when the thread returns to the thread pool.  Yes, this creates zillions
of short-lived ThreadLocals but they get cleaned out quickly so there is
performance degradation for ThreadLocal lookup.

Each thread cannot have its sliced ByteBuffer passed through the stack
as an argument.  This would create a lot of garbage from duplicate
structures.

-Nathan

On 10/17/2018 10:29 AM, Andrew Haley via Concurrency-interest wrote:
> On 10/17/2018 04:07 PM, Doug Lea via Concurrency-interest wrote:
>> This alone would not address long-standing problems with zillions of
>> short-lived ThreadLocals in long-lived threads.  Right now, ThreadLocal
>> is as fast as we know how to make it while still not completely falling
>> over under such usages. The only solution I know for this is to create a
>> new GC-aware storage class, which is not very likely to be adopted.
> Well, yeah, but creating zillions of short-lived ThreadLocals seems
> like an antipattern to me.
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On Wed, Oct 17, 2018 at 1:41 PM David Lloyd <[hidden email]> wrote:
On Wed, Oct 17, 2018 at 12:01 PM Tim Peierls <[hidden email]> wrote:
> On Wed, Oct 17, 2018 at 12:11 PM David Lloyd <[hidden email]> wrote:
>> On Wed, Oct 17, 2018 at 11:01 AM Tim Peierls wrote:
>> > On Wed, Oct 17, 2018 at 11:28 AM Doug Lea via Concurrency-interest <[hidden email]> wrote:
>> >> Also consider restructuring code to use task-local classes (perhaps with linkages among them) that are GCable when tasks complete.
>> >
>> > I bet a lot of the ThreadLocal uses under consideration would benefit from this kind of restructuring, avoiding ThreadLocal entirely.
>>
>> How would one access task-local data if not by way of a ThreadLocal?
>
> I'm thinking of settings where a task instance is confined to a single thread, so that the fields of that instance are safely accessible and modifiable without need of any other machinery.

That's fair, but I'm not talking about safety; I was asking specifically how task-local classes can be accessible at all.  One can allocate objects on the heap, and use them either within a single lexical scope, pass them among methods as explicit method parameters, or passed implicitly between methods by direct or indirect usage of a thread local (either ThreadLocal or through some other path from Thread to the data in question, including fields on a Thread subclass).

I'm talking about submitting tasks to execution services that wrap thread pools:
class MyTask implements Supplier<Result> {
    // task-local state as fields of MyTask:
    ... 

    MyTask(...) { ... }

    @Override public Result get() {
        ... do some work involving task-local state and eventually (maybe) produce a Result ...
    }
}

CompletableFuture<Result> future = CompletableFuture.supplyAsync(new MyTask(...));
That's all I meant.

 
Doug's mention of "task local classes" sounds like he was alluding to some new mode of access other than these three, so I would be interested to know of such a thing.

Not positive what Doug meant, but I don't think he was talking about a new mode of access when he wrote "task-local classes".


I suspect that it is common for people to use per-thread state simply because it's easy, rather than trying to structure things in terms of tasks that run in a single thread at a time (with happens-before edges wherever the task is passed to another thread).

Maybe.

Sounds like a good question to pose to people who can search large codebases for patterns of use! Kevin Bourrillion, are you listening?

--tim 

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
1234