Overhead of ThreadLocal data

classic Classic list List threaded Threaded
64 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
@nathan 
The point that Andrew raised is more related to safepoints: on MappedByteBuffer access you're in the JVM land and you're working between safepoints ie you got blocked by the kernel means that you won't arrive soon to the next safepoint, delaying the others that are waiting for you to join.
With FileChannel you're in the JNI that is worlking *in* a safepoint and won't delay any java mutator threads to reach it (if needed). Makes sense? 

Il giorno gio 18 ott 2018 alle ore 18:03 Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> ha scritto:
When accessing a file through the kernel's file I/O, the thread context
switches from user land into kernel land.  The thread then goes to the
file cache to see if the data is there.  If not, the kernel blocks the
thread and pulls the data from disk.  Once the data is in the file
cache, the thread copies the data into the user land buffer and context
switches back to user land.

When accessing a file through memory mapped I/O, the thread does a load
instruction against RAM.  If the data is not in RAM, then the thread
switches to the kernel, blocks while pulling data from disk and resumes
operation.

File I/O and memory mapped I/O do the same operations but in a different
order.  The difference is key.  With file I/O, the thread has to context
switch into the kernel with every access. Thus, we use large buffers to
minimize the performance impact of the kernel round trip.  It is the
context switch with every operation that hurts file I/O and where memory
mapped I/O shines. So, memory mapped I/O does well at concurrent random
reads of large files, but comes with an initialization cost and isn't
the best solution for all file access.  I have found that file I/O
considerably outperforms memory mapped I/O when sequentially reading and
writing to a file unless you can map the entire file in one large piece.

-Nathan

On 10/18/2018 3:58 AM, Andrew Haley wrote:
> On 10/18/2018 10:29 AM, Andrey Pavlenko wrote:
>> In case of a non-large file and/or sequential reads MappedByteBuffer is
>> definitely faster, but the question here is about *concurrent* *random*
>> reads of *large* files. For this purpose MappedByteBuffer may be
>> inefficient.
> Perhaps. It might be that manually managing memory (by reading and
> writing the parts of the file you need) works better than letting the
> kernel do it, but the kernel will cache as much as it can anyway, so
> it's not as if it'll necessarily save memory or reduce disk activity.
> There are advantages to using read() for sequential file access because
> the kernel can automatically read and cache the next part of the file.
>
> There were some problems with inefficient code generation for byte
> buffers but we have worked on that and it's better now, with (even)
> more improvements to come. Unless there are kernel issues I don't know
> about, mapped files are excellent for random access, and the Java
> ByteBuffer operations generate excellent code. (This isn't guaranteed
> because C2 uses a bunch of heuristics, but it's usually good.)
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list

No, it doesn't. Accessing memory mapped area is off heap, so what makes it less safe than JNI?

On 18 Oct 2018 17:58, "Francesco Nigro via Concurrency-interest" <[hidden email]> wrote:
@nathan 
The point that Andrew raised is more related to safepoints: on MappedByteBuffer access you're in the JVM land and you're working between safepoints ie you got blocked by the kernel means that you won't arrive soon to the next safepoint, delaying the others that are waiting for you to join.
With FileChannel you're in the JNI that is worlking *in* a safepoint and won't delay any java mutator threads to reach it (if needed). Makes sense? 

Il giorno gio 18 ott 2018 alle ore 18:03 Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> ha scritto:
When accessing a file through the kernel's file I/O, the thread context
switches from user land into kernel land.  The thread then goes to the
file cache to see if the data is there.  If not, the kernel blocks the
thread and pulls the data from disk.  Once the data is in the file
cache, the thread copies the data into the user land buffer and context
switches back to user land.

When accessing a file through memory mapped I/O, the thread does a load
instruction against RAM.  If the data is not in RAM, then the thread
switches to the kernel, blocks while pulling data from disk and resumes
operation.

File I/O and memory mapped I/O do the same operations but in a different
order.  The difference is key.  With file I/O, the thread has to context
switch into the kernel with every access. Thus, we use large buffers to
minimize the performance impact of the kernel round trip.  It is the
context switch with every operation that hurts file I/O and where memory
mapped I/O shines. So, memory mapped I/O does well at concurrent random
reads of large files, but comes with an initialization cost and isn't
the best solution for all file access.  I have found that file I/O
considerably outperforms memory mapped I/O when sequentially reading and
writing to a file unless you can map the entire file in one large piece.

-Nathan

On 10/18/2018 3:58 AM, Andrew Haley wrote:
> On 10/18/2018 10:29 AM, Andrey Pavlenko wrote:
>> In case of a non-large file and/or sequential reads MappedByteBuffer is
>> definitely faster, but the question here is about *concurrent* *random*
>> reads of *large* files. For this purpose MappedByteBuffer may be
>> inefficient.
> Perhaps. It might be that manually managing memory (by reading and
> writing the parts of the file you need) works better than letting the
> kernel do it, but the kernel will cache as much as it can anyway, so
> it's not as if it'll necessarily save memory or reduce disk activity.
> There are advantages to using read() for sequential file access because
> the kernel can automatically read and cache the next part of the file.
>
> There were some problems with inefficient code generation for byte
> buffers but we have worked on that and it's better now, with (even)
> more improvements to come. Unless there are kernel issues I don't know
> about, mapped files are excellent for random access, and the Java
> ByteBuffer operations generate excellent code. (This isn't guaranteed
> because C2 uses a bunch of heuristics, but it's usually good.)
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list

Yes, thank you for clarifying.  Besides taking the thread's callstack (?) and GC, what other operations wait for a Java thread to reach a safepoint?

-Nathan
On 10/18/2018 10:06 AM, Francesco Nigro wrote:
@nathan 
The point that Andrew raised is more related to safepoints: on MappedByteBuffer access you're in the JVM land and you're working between safepoints ie you got blocked by the kernel means that you won't arrive soon to the next safepoint, delaying the others that are waiting for you to join.
With FileChannel you're in the JNI that is worlking *in* a safepoint and won't delay any java mutator threads to reach it (if needed). Makes sense? 

Il giorno gio 18 ott 2018 alle ore 18:03 Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> ha scritto:
When accessing a file through the kernel's file I/O, the thread context
switches from user land into kernel land.  The thread then goes to the
file cache to see if the data is there.  If not, the kernel blocks the
thread and pulls the data from disk.  Once the data is in the file
cache, the thread copies the data into the user land buffer and context
switches back to user land.

When accessing a file through memory mapped I/O, the thread does a load
instruction against RAM.  If the data is not in RAM, then the thread
switches to the kernel, blocks while pulling data from disk and resumes
operation.

File I/O and memory mapped I/O do the same operations but in a different
order.  The difference is key.  With file I/O, the thread has to context
switch into the kernel with every access. Thus, we use large buffers to
minimize the performance impact of the kernel round trip.  It is the
context switch with every operation that hurts file I/O and where memory
mapped I/O shines. So, memory mapped I/O does well at concurrent random
reads of large files, but comes with an initialization cost and isn't
the best solution for all file access.  I have found that file I/O
considerably outperforms memory mapped I/O when sequentially reading and
writing to a file unless you can map the entire file in one large piece.

-Nathan

On 10/18/2018 3:58 AM, Andrew Haley wrote:
> On 10/18/2018 10:29 AM, Andrey Pavlenko wrote:
>> In case of a non-large file and/or sequential reads MappedByteBuffer is
>> definitely faster, but the question here is about *concurrent* *random*
>> reads of *large* files. For this purpose MappedByteBuffer may be
>> inefficient.
> Perhaps. It might be that manually managing memory (by reading and
> writing the parts of the file you need) works better than letting the
> kernel do it, but the kernel will cache as much as it can anyway, so
> it's not as if it'll necessarily save memory or reduce disk activity.
> There are advantages to using read() for sequential file access because
> the kernel can automatically read and cache the next part of the file.
>
> There were some problems with inefficient code generation for byte
> buffers but we have worked on that and it's better now, with (even)
> more improvements to come. Unless there are kernel issues I don't know
> about, mapped files are excellent for random access, and the Java
> ByteBuffer operations generate excellent code. (This isn't guaranteed
> because C2 uses a bunch of heuristics, but it's usually good.)
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
Isn't less "safe" but if a MappedByteBuffer::get got inlined and the Unsafe call is an intrinsic I'm expecting it to not have any safepoints poll in it (if not inlined it "should" have a poll_return on method exit AFAIK). I have to verify it with newer versions of the JVM, but that's a quite big difference AFAIK if compared with a JNI call.
Am I missing anything? I apologize if I've written something incorrect or not precise enough 

Sorry I have answered privately this one :)

Il gio 18 ott 2018, 19:06 Alex Otenko <[hidden email]> ha scritto:

No, it doesn't. Accessing memory mapped area is off heap, so what makes it less safe than JNI?

On 18 Oct 2018 17:58, "Francesco Nigro via Concurrency-interest" <[hidden email]> wrote:
@nathan 
The point that Andrew raised is more related to safepoints: on MappedByteBuffer access you're in the JVM land and you're working between safepoints ie you got blocked by the kernel means that you won't arrive soon to the next safepoint, delaying the others that are waiting for you to join.
With FileChannel you're in the JNI that is worlking *in* a safepoint and won't delay any java mutator threads to reach it (if needed). Makes sense? 

Il giorno gio 18 ott 2018 alle ore 18:03 Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> ha scritto:
When accessing a file through the kernel's file I/O, the thread context
switches from user land into kernel land.  The thread then goes to the
file cache to see if the data is there.  If not, the kernel blocks the
thread and pulls the data from disk.  Once the data is in the file
cache, the thread copies the data into the user land buffer and context
switches back to user land.

When accessing a file through memory mapped I/O, the thread does a load
instruction against RAM.  If the data is not in RAM, then the thread
switches to the kernel, blocks while pulling data from disk and resumes
operation.

File I/O and memory mapped I/O do the same operations but in a different
order.  The difference is key.  With file I/O, the thread has to context
switch into the kernel with every access. Thus, we use large buffers to
minimize the performance impact of the kernel round trip.  It is the
context switch with every operation that hurts file I/O and where memory
mapped I/O shines. So, memory mapped I/O does well at concurrent random
reads of large files, but comes with an initialization cost and isn't
the best solution for all file access.  I have found that file I/O
considerably outperforms memory mapped I/O when sequentially reading and
writing to a file unless you can map the entire file in one large piece.

-Nathan

On 10/18/2018 3:58 AM, Andrew Haley wrote:
> On 10/18/2018 10:29 AM, Andrey Pavlenko wrote:
>> In case of a non-large file and/or sequential reads MappedByteBuffer is
>> definitely faster, but the question here is about *concurrent* *random*
>> reads of *large* files. For this purpose MappedByteBuffer may be
>> inefficient.
> Perhaps. It might be that manually managing memory (by reading and
> writing the parts of the file you need) works better than letting the
> kernel do it, but the kernel will cache as much as it can anyway, so
> it's not as if it'll necessarily save memory or reduce disk activity.
> There are advantages to using read() for sequential file access because
> the kernel can automatically read and cache the next part of the file.
>
> There were some problems with inefficient code generation for byte
> buffers but we have worked on that and it's better now, with (even)
> more improvements to come. Unless there are kernel issues I don't know
> about, mapped files are excellent for random access, and the Java
> ByteBuffer operations generate excellent code. (This isn't guaranteed
> because C2 uses a bunch of heuristics, but it's usually good.)
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
@nathan I do not have a complete list and I was asking to Andrew this same question when 

Il gio 18 ott 2018, 19:27 Nathan and Ila Reynolds <[hidden email]> ha scritto:

Yes, thank you for clarifying.  Besides taking the thread's callstack (?) and GC, what other operations wait for a Java thread to reach a safepoint?

-Nathan
On 10/18/2018 10:06 AM, Francesco Nigro wrote:
@nathan 
The point that Andrew raised is more related to safepoints: on MappedByteBuffer access you're in the JVM land and you're working between safepoints ie you got blocked by the kernel means that you won't arrive soon to the next safepoint, delaying the others that are waiting for you to join.
With FileChannel you're in the JNI that is worlking *in* a safepoint and won't delay any java mutator threads to reach it (if needed). Makes sense? 

Il giorno gio 18 ott 2018 alle ore 18:03 Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> ha scritto:
When accessing a file through the kernel's file I/O, the thread context
switches from user land into kernel land.  The thread then goes to the
file cache to see if the data is there.  If not, the kernel blocks the
thread and pulls the data from disk.  Once the data is in the file
cache, the thread copies the data into the user land buffer and context
switches back to user land.

When accessing a file through memory mapped I/O, the thread does a load
instruction against RAM.  If the data is not in RAM, then the thread
switches to the kernel, blocks while pulling data from disk and resumes
operation.

File I/O and memory mapped I/O do the same operations but in a different
order.  The difference is key.  With file I/O, the thread has to context
switch into the kernel with every access. Thus, we use large buffers to
minimize the performance impact of the kernel round trip.  It is the
context switch with every operation that hurts file I/O and where memory
mapped I/O shines. So, memory mapped I/O does well at concurrent random
reads of large files, but comes with an initialization cost and isn't
the best solution for all file access.  I have found that file I/O
considerably outperforms memory mapped I/O when sequentially reading and
writing to a file unless you can map the entire file in one large piece.

-Nathan

On 10/18/2018 3:58 AM, Andrew Haley wrote:
> On 10/18/2018 10:29 AM, Andrey Pavlenko wrote:
>> In case of a non-large file and/or sequential reads MappedByteBuffer is
>> definitely faster, but the question here is about *concurrent* *random*
>> reads of *large* files. For this purpose MappedByteBuffer may be
>> inefficient.
> Perhaps. It might be that manually managing memory (by reading and
> writing the parts of the file you need) works better than letting the
> kernel do it, but the kernel will cache as much as it can anyway, so
> it's not as if it'll necessarily save memory or reduce disk activity.
> There are advantages to using read() for sequential file access because
> the kernel can automatically read and cache the next part of the file.
>
> There were some problems with inefficient code generation for byte
> buffers but we have worked on that and it's better now, with (even)
> more improvements to come. Unless there are kernel issues I don't know
> about, mapped files are excellent for random access, and the Java
> ByteBuffer operations generate excellent code. (This isn't guaranteed
> because C2 uses a bunch of heuristics, but it's usually good.)
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list

When a thread reads a memory mapped area, it does a load instruction.  If the page is not in RAM or the process's page table, then the thread will exit user land and enter the kernel.  The thread could then load the page from disk.  This could take a while especially if the disk is heavily loaded.  Meanwhile, the JVM could be waiting for the thread to reach a safepoint.  It won't be able to until the thread returns from loading the page from disk.

With regular file I/O, the thread can be blocked inside the kernel.  When it returns from the kernel, it checks a flag and sees that it should block.  Thus, the thread does not execute Java code during a stop-the-world operation.  Hence, the JVM can assume the thread has reached a safepoint.

-Nathan
On 10/18/2018 11:06 AM, Alex Otenko wrote:

No, it doesn't. Accessing memory mapped area is off heap, so what makes it less safe than JNI?

On 18 Oct 2018 17:58, "Francesco Nigro via Concurrency-interest" <[hidden email]> wrote:
@nathan 
The point that Andrew raised is more related to safepoints: on MappedByteBuffer access you're in the JVM land and you're working between safepoints ie you got blocked by the kernel means that you won't arrive soon to the next safepoint, delaying the others that are waiting for you to join.
With FileChannel you're in the JNI that is worlking *in* a safepoint and won't delay any java mutator threads to reach it (if needed). Makes sense? 

Il giorno gio 18 ott 2018 alle ore 18:03 Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> ha scritto:
When accessing a file through the kernel's file I/O, the thread context
switches from user land into kernel land.  The thread then goes to the
file cache to see if the data is there.  If not, the kernel blocks the
thread and pulls the data from disk.  Once the data is in the file
cache, the thread copies the data into the user land buffer and context
switches back to user land.

When accessing a file through memory mapped I/O, the thread does a load
instruction against RAM.  If the data is not in RAM, then the thread
switches to the kernel, blocks while pulling data from disk and resumes
operation.

File I/O and memory mapped I/O do the same operations but in a different
order.  The difference is key.  With file I/O, the thread has to context
switch into the kernel with every access. Thus, we use large buffers to
minimize the performance impact of the kernel round trip.  It is the
context switch with every operation that hurts file I/O and where memory
mapped I/O shines. So, memory mapped I/O does well at concurrent random
reads of large files, but comes with an initialization cost and isn't
the best solution for all file access.  I have found that file I/O
considerably outperforms memory mapped I/O when sequentially reading and
writing to a file unless you can map the entire file in one large piece.

-Nathan

On 10/18/2018 3:58 AM, Andrew Haley wrote:
> On 10/18/2018 10:29 AM, Andrey Pavlenko wrote:
>> In case of a non-large file and/or sequential reads MappedByteBuffer is
>> definitely faster, but the question here is about *concurrent* *random*
>> reads of *large* files. For this purpose MappedByteBuffer may be
>> inefficient.
> Perhaps. It might be that manually managing memory (by reading and
> writing the parts of the file you need) works better than letting the
> kernel do it, but the kernel will cache as much as it can anyway, so
> it's not as if it'll necessarily save memory or reduce disk activity.
> There are advantages to using read() for sequential file access because
> the kernel can automatically read and cache the next part of the file.
>
> There were some problems with inefficient code generation for byte
> buffers but we have worked on that and it's better now, with (even)
> more improvements to come. Unless there are kernel issues I don't know
> about, mapped files are excellent for random access, and the Java
> ByteBuffer operations generate excellent code. (This isn't guaranteed
> because C2 uses a bunch of heuristics, but it's usually good.)
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list

This is strange. While the page is being loaded, the thread is not on the CPU. Does the GC wait for all the threads to get back on the CPU? Or is there some other reason that doesn't allow to find out whether the thread is on the CPU?

Alex

On 18 Oct 2018 18:38, "Nathan and Ila Reynolds" <[hidden email]> wrote:

When a thread reads a memory mapped area, it does a load instruction.  If the page is not in RAM or the process's page table, then the thread will exit user land and enter the kernel.  The thread could then load the page from disk.  This could take a while especially if the disk is heavily loaded.  Meanwhile, the JVM could be waiting for the thread to reach a safepoint.  It won't be able to until the thread returns from loading the page from disk.

With regular file I/O, the thread can be blocked inside the kernel.  When it returns from the kernel, it checks a flag and sees that it should block.  Thus, the thread does not execute Java code during a stop-the-world operation.  Hence, the JVM can assume the thread has reached a safepoint.

-Nathan
On 10/18/2018 11:06 AM, Alex Otenko wrote:

No, it doesn't. Accessing memory mapped area is off heap, so what makes it less safe than JNI?

On 18 Oct 2018 17:58, "Francesco Nigro via Concurrency-interest" <[hidden email]> wrote:
@nathan 
The point that Andrew raised is more related to safepoints: on MappedByteBuffer access you're in the JVM land and you're working between safepoints ie you got blocked by the kernel means that you won't arrive soon to the next safepoint, delaying the others that are waiting for you to join.
With FileChannel you're in the JNI that is worlking *in* a safepoint and won't delay any java mutator threads to reach it (if needed). Makes sense? 

Il giorno gio 18 ott 2018 alle ore 18:03 Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> ha scritto:
When accessing a file through the kernel's file I/O, the thread context
switches from user land into kernel land.  The thread then goes to the
file cache to see if the data is there.  If not, the kernel blocks the
thread and pulls the data from disk.  Once the data is in the file
cache, the thread copies the data into the user land buffer and context
switches back to user land.

When accessing a file through memory mapped I/O, the thread does a load
instruction against RAM.  If the data is not in RAM, then the thread
switches to the kernel, blocks while pulling data from disk and resumes
operation.

File I/O and memory mapped I/O do the same operations but in a different
order.  The difference is key.  With file I/O, the thread has to context
switch into the kernel with every access. Thus, we use large buffers to
minimize the performance impact of the kernel round trip.  It is the
context switch with every operation that hurts file I/O and where memory
mapped I/O shines. So, memory mapped I/O does well at concurrent random
reads of large files, but comes with an initialization cost and isn't
the best solution for all file access.  I have found that file I/O
considerably outperforms memory mapped I/O when sequentially reading and
writing to a file unless you can map the entire file in one large piece.

-Nathan

On 10/18/2018 3:58 AM, Andrew Haley wrote:
> On 10/18/2018 10:29 AM, Andrey Pavlenko wrote:
>> In case of a non-large file and/or sequential reads MappedByteBuffer is
>> definitely faster, but the question here is about *concurrent* *random*
>> reads of *large* files. For this purpose MappedByteBuffer may be
>> inefficient.
> Perhaps. It might be that manually managing memory (by reading and
> writing the parts of the file you need) works better than letting the
> kernel do it, but the kernel will cache as much as it can anyway, so
> it's not as if it'll necessarily save memory or reduce disk activity.
> There are advantages to using read() for sequential file access because
> the kernel can automatically read and cache the next part of the file.
>
> There were some problems with inefficient code generation for byte
> buffers but we have worked on that and it's better now, with (even)
> more improvements to come. Unless there are kernel issues I don't know
> about, mapped files are excellent for random access, and the Java
> ByteBuffer operations generate excellent code. (This isn't guaranteed
> because C2 uses a bunch of heuristics, but it's usually good.)
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
It depends if has been triggered a global safepoints operation (or the the JVM is triggering its usual safepoint at interval, 1 sec by default IIRC): in that case the answer is yes, all the threads need to reach their safepoint poll to be *in* the safepoint.

Il gio 18 ott 2018, 19:42 Alex Otenko <[hidden email]> ha scritto:

This is strange. While the page is being loaded, the thread is not on the CPU. Does the GC wait for all the threads to get back on the CPU? Or is there some other reason that doesn't allow to find out whether the thread is on the CPU?

Alex

On 18 Oct 2018 18:38, "Nathan and Ila Reynolds" <[hidden email]> wrote:

When a thread reads a memory mapped area, it does a load instruction.  If the page is not in RAM or the process's page table, then the thread will exit user land and enter the kernel.  The thread could then load the page from disk.  This could take a while especially if the disk is heavily loaded.  Meanwhile, the JVM could be waiting for the thread to reach a safepoint.  It won't be able to until the thread returns from loading the page from disk.

With regular file I/O, the thread can be blocked inside the kernel.  When it returns from the kernel, it checks a flag and sees that it should block.  Thus, the thread does not execute Java code during a stop-the-world operation.  Hence, the JVM can assume the thread has reached a safepoint.

-Nathan
On 10/18/2018 11:06 AM, Alex Otenko wrote:

No, it doesn't. Accessing memory mapped area is off heap, so what makes it less safe than JNI?

On 18 Oct 2018 17:58, "Francesco Nigro via Concurrency-interest" <[hidden email]> wrote:
@nathan 
The point that Andrew raised is more related to safepoints: on MappedByteBuffer access you're in the JVM land and you're working between safepoints ie you got blocked by the kernel means that you won't arrive soon to the next safepoint, delaying the others that are waiting for you to join.
With FileChannel you're in the JNI that is worlking *in* a safepoint and won't delay any java mutator threads to reach it (if needed). Makes sense? 

Il giorno gio 18 ott 2018 alle ore 18:03 Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> ha scritto:
When accessing a file through the kernel's file I/O, the thread context
switches from user land into kernel land.  The thread then goes to the
file cache to see if the data is there.  If not, the kernel blocks the
thread and pulls the data from disk.  Once the data is in the file
cache, the thread copies the data into the user land buffer and context
switches back to user land.

When accessing a file through memory mapped I/O, the thread does a load
instruction against RAM.  If the data is not in RAM, then the thread
switches to the kernel, blocks while pulling data from disk and resumes
operation.

File I/O and memory mapped I/O do the same operations but in a different
order.  The difference is key.  With file I/O, the thread has to context
switch into the kernel with every access. Thus, we use large buffers to
minimize the performance impact of the kernel round trip.  It is the
context switch with every operation that hurts file I/O and where memory
mapped I/O shines. So, memory mapped I/O does well at concurrent random
reads of large files, but comes with an initialization cost and isn't
the best solution for all file access.  I have found that file I/O
considerably outperforms memory mapped I/O when sequentially reading and
writing to a file unless you can map the entire file in one large piece.

-Nathan

On 10/18/2018 3:58 AM, Andrew Haley wrote:
> On 10/18/2018 10:29 AM, Andrey Pavlenko wrote:
>> In case of a non-large file and/or sequential reads MappedByteBuffer is
>> definitely faster, but the question here is about *concurrent* *random*
>> reads of *large* files. For this purpose MappedByteBuffer may be
>> inefficient.
> Perhaps. It might be that manually managing memory (by reading and
> writing the parts of the file you need) works better than letting the
> kernel do it, but the kernel will cache as much as it can anyway, so
> it's not as if it'll necessarily save memory or reduce disk activity.
> There are advantages to using read() for sequential file access because
> the kernel can automatically read and cache the next part of the file.
>
> There were some problems with inefficient code generation for byte
> buffers but we have worked on that and it's better now, with (even)
> more improvements to come. Unless there are kernel issues I don't know
> about, mapped files are excellent for random access, and the Java
> ByteBuffer operations generate excellent code. (This isn't guaranteed
> because C2 uses a bunch of heuristics, but it's usually good.)
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 10/18/2018 06:42 PM, Alex Otenko via Concurrency-interest wrote:
> This is strange. While the page is being loaded, the thread is not on the
> CPU. Does the GC wait for all the threads to get back on the CPU?

Yes, absolutely. It has to do that.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 18 Oct 2018 18:38, "Nathan and Ila Reynolds" <[hidden email]> wrote:

> When a thread reads a memory mapped area, it does a load instruction.  If
> the page is not in RAM or the process's page table, then the thread will
> exit user land and enter the kernel.  The thread could then load the page
> from disk.  This could take a while especially if the disk is heavily
> loaded.  Meanwhile, the JVM could be waiting for the thread to reach a
> safepoint.  It won't be able to until the thread returns from loading the
> page from disk.
>
> With regular file I/O, the thread can be blocked inside the kernel.  When
> it returns from the kernel, it checks a flag and sees that it should
> block.  Thus, the thread does not execute Java code during a stop-the-world
> operation.  Hence, the JVM can assume the thread has reached a safepoint.

Not exactly. A native call *is* a safepoint: the GC can run even while a
native method is blocked in the kernel. The GC cannot run while we are
running Java code, even if that Java code is blocked by the kernel,
because we do not know when the kernel will restart that Java code. It
could be at any time. So we have to wait for the Java code to reach a
safepoint, and that means we have to wait for it to be unblocked by the
kernel.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list

The VM doesn’t try to determine if a thread is on CPU and if it did it would do no good as it can’t know when it comes back on CPU. A thread executes in a number of possible safepoint related states, mainly: in_java, in_VM, in_native and blocked. The first two require that the VMThread wait until the thread notices the safepoint request and changes to a safe state – there is where delay can occur. The latter two states are safepoint-safe and a thread can’t transition out of those states if a safepoint is in progress. The “blocked” state is for synchronization operations and sleeps, not I/O, which is typically performed in_Native. There are times when a thread goes “native” without a state transition (e.g. fast JNI field accesses) so those operations have to be short to avoid safepoint delays.

 

Hotspot is moving away from global safepoints where possible to use thread-directed handshakes. New GC’s will take advantage of this but old ones will still utilize a global safepoint for at least part of the GC operation.

 

David

 

From: Concurrency-interest <[hidden email]> On Behalf Of Alex Otenko via Concurrency-interest
Sent: Friday, October 19, 2018 3:43 AM
To: Nathan and Ila Reynolds <[hidden email]>
Cc: [hidden email]
Subject: Re: [concurrency-interest] Overhead of ThreadLocal data

 

This is strange. While the page is being loaded, the thread is not on the CPU. Does the GC wait for all the threads to get back on the CPU? Or is there some other reason that doesn't allow to find out whether the thread is on the CPU?

Alex

On 18 Oct 2018 18:38, "Nathan and Ila Reynolds" <[hidden email]> wrote:

When a thread reads a memory mapped area, it does a load instruction.  If the page is not in RAM or the process's page table, then the thread will exit user land and enter the kernel.  The thread could then load the page from disk.  This could take a while especially if the disk is heavily loaded.  Meanwhile, the JVM could be waiting for the thread to reach a safepoint.  It won't be able to until the thread returns from loading the page from disk.

With regular file I/O, the thread can be blocked inside the kernel.  When it returns from the kernel, it checks a flag and sees that it should block.  Thus, the thread does not execute Java code during a stop-the-world operation.  Hence, the JVM can assume the thread has reached a safepoint.

-Nathan

On 10/18/2018 11:06 AM, Alex Otenko wrote:

No, it doesn't. Accessing memory mapped area is off heap, so what makes it less safe than JNI?

On 18 Oct 2018 17:58, "Francesco Nigro via Concurrency-interest" <[hidden email]> wrote:

@nathan 

The point that Andrew raised is more related to safepoints: on MappedByteBuffer access you're in the JVM land and you're working between safepoints ie you got blocked by the kernel means that you won't arrive soon to the next safepoint, delaying the others that are waiting for you to join.

With FileChannel you're in the JNI that is worlking *in* a safepoint and won't delay any java mutator threads to reach it (if needed). Makes sense? 

 

Il giorno gio 18 ott 2018 alle ore 18:03 Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> ha scritto:

When accessing a file through the kernel's file I/O, the thread context
switches from user land into kernel land.  The thread then goes to the
file cache to see if the data is there.  If not, the kernel blocks the
thread and pulls the data from disk.  Once the data is in the file
cache, the thread copies the data into the user land buffer and context
switches back to user land.

When accessing a file through memory mapped I/O, the thread does a load
instruction against RAM.  If the data is not in RAM, then the thread
switches to the kernel, blocks while pulling data from disk and resumes
operation.

File I/O and memory mapped I/O do the same operations but in a different
order.  The difference is key.  With file I/O, the thread has to context
switch into the kernel with every access. Thus, we use large buffers to
minimize the performance impact of the kernel round trip.  It is the
context switch with every operation that hurts file I/O and where memory
mapped I/O shines. So, memory mapped I/O does well at concurrent random
reads of large files, but comes with an initialization cost and isn't
the best solution for all file access.  I have found that file I/O
considerably outperforms memory mapped I/O when sequentially reading and
writing to a file unless you can map the entire file in one large piece.

-Nathan

On 10/18/2018 3:58 AM, Andrew Haley wrote:


> On 10/18/2018 10:29 AM, Andrey Pavlenko wrote:
>> In case of a non-large file and/or sequential reads MappedByteBuffer is
>> definitely faster, but the question here is about *concurrent* *random*
>> reads of *large* files. For this purpose MappedByteBuffer may be
>> inefficient.
> Perhaps. It might be that manually managing memory (by reading and
> writing the parts of the file you need) works better than letting the
> kernel do it, but the kernel will cache as much as it can anyway, so
> it's not as if it'll necessarily save memory or reduce disk activity.
> There are advantages to using read() for sequential file access because
> the kernel can automatically read and cache the next part of the file.
>
> There were some problems with inefficient code generation for byte
> buffers but we have worked on that and it's better now, with (even)
> more improvements to come. Unless there are kernel issues I don't know
> about, mapped files are excellent for random access, and the Java
> ByteBuffer operations generate excellent code. (This isn't guaranteed
> because C2 uses a bunch of heuristics, but it's usually good.)
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list

Correction: new GCs will still have some global safepoints.

 

David

 

From: Concurrency-interest <[hidden email]> On Behalf Of David Holmes via Concurrency-interest
Sent: Friday, October 19, 2018 7:02 AM
To: 'Alex Otenko' <[hidden email]>
Cc: David Holmes <[hidden email]>; 'concurrency-interest' <[hidden email]>
Subject: Re: [concurrency-interest] Overhead of ThreadLocal data

 

The VM doesn’t try to determine if a thread is on CPU and if it did it would do no good as it can’t know when it comes back on CPU. A thread executes in a number of possible safepoint related states, mainly: in_java, in_VM, in_native and blocked. The first two require that the VMThread wait until the thread notices the safepoint request and changes to a safe state – there is where delay can occur. The latter two states are safepoint-safe and a thread can’t transition out of those states if a safepoint is in progress. The “blocked” state is for synchronization operations and sleeps, not I/O, which is typically performed in_Native. There are times when a thread goes “native” without a state transition (e.g. fast JNI field accesses) so those operations have to be short to avoid safepoint delays.

 

Hotspot is moving away from global safepoints where possible to use thread-directed handshakes. New GC’s will take advantage of this but old ones will still utilize a global safepoint for at least part of the GC operation.

 

David

 

From: Concurrency-interest <[hidden email]> On Behalf Of Alex Otenko via Concurrency-interest
Sent: Friday, October 19, 2018 3:43 AM
To: Nathan and Ila Reynolds <[hidden email]>
Cc: [hidden email]
Subject: Re: [concurrency-interest] Overhead of ThreadLocal data

 

This is strange. While the page is being loaded, the thread is not on the CPU. Does the GC wait for all the threads to get back on the CPU? Or is there some other reason that doesn't allow to find out whether the thread is on the CPU?

Alex

On 18 Oct 2018 18:38, "Nathan and Ila Reynolds" <[hidden email]> wrote:

When a thread reads a memory mapped area, it does a load instruction.  If the page is not in RAM or the process's page table, then the thread will exit user land and enter the kernel.  The thread could then load the page from disk.  This could take a while especially if the disk is heavily loaded.  Meanwhile, the JVM could be waiting for the thread to reach a safepoint.  It won't be able to until the thread returns from loading the page from disk.

With regular file I/O, the thread can be blocked inside the kernel.  When it returns from the kernel, it checks a flag and sees that it should block.  Thus, the thread does not execute Java code during a stop-the-world operation.  Hence, the JVM can assume the thread has reached a safepoint.

-Nathan

On 10/18/2018 11:06 AM, Alex Otenko wrote:

No, it doesn't. Accessing memory mapped area is off heap, so what makes it less safe than JNI?

On 18 Oct 2018 17:58, "Francesco Nigro via Concurrency-interest" <[hidden email]> wrote:

@nathan 

The point that Andrew raised is more related to safepoints: on MappedByteBuffer access you're in the JVM land and you're working between safepoints ie you got blocked by the kernel means that you won't arrive soon to the next safepoint, delaying the others that are waiting for you to join.

With FileChannel you're in the JNI that is worlking *in* a safepoint and won't delay any java mutator threads to reach it (if needed). Makes sense? 

 

Il giorno gio 18 ott 2018 alle ore 18:03 Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> ha scritto:

When accessing a file through the kernel's file I/O, the thread context
switches from user land into kernel land.  The thread then goes to the
file cache to see if the data is there.  If not, the kernel blocks the
thread and pulls the data from disk.  Once the data is in the file
cache, the thread copies the data into the user land buffer and context
switches back to user land.

When accessing a file through memory mapped I/O, the thread does a load
instruction against RAM.  If the data is not in RAM, then the thread
switches to the kernel, blocks while pulling data from disk and resumes
operation.

File I/O and memory mapped I/O do the same operations but in a different
order.  The difference is key.  With file I/O, the thread has to context
switch into the kernel with every access. Thus, we use large buffers to
minimize the performance impact of the kernel round trip.  It is the
context switch with every operation that hurts file I/O and where memory
mapped I/O shines. So, memory mapped I/O does well at concurrent random
reads of large files, but comes with an initialization cost and isn't
the best solution for all file access.  I have found that file I/O
considerably outperforms memory mapped I/O when sequentially reading and
writing to a file unless you can map the entire file in one large piece.

-Nathan

On 10/18/2018 3:58 AM, Andrew Haley wrote:


> On 10/18/2018 10:29 AM, Andrey Pavlenko wrote:
>> In case of a non-large file and/or sequential reads MappedByteBuffer is
>> definitely faster, but the question here is about *concurrent* *random*
>> reads of *large* files. For this purpose MappedByteBuffer may be
>> inefficient.
> Perhaps. It might be that manually managing memory (by reading and
> writing the parts of the file you need) works better than letting the
> kernel do it, but the kernel will cache as much as it can anyway, so
> it's not as if it'll necessarily save memory or reduce disk activity.
> There are advantages to using read() for sequential file access because
> the kernel can automatically read and cache the next part of the file.
>
> There were some problems with inefficient code generation for byte
> buffers but we have worked on that and it's better now, with (even)
> more improvements to come. Unless there are kernel issues I don't know
> about, mapped files are excellent for random access, and the Java
> ByteBuffer operations generate excellent code. (This isn't guaranteed
> because C2 uses a bunch of heuristics, but it's usually good.)
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
But at least now is possible to trigger "local safepoints" too ie http://openjdk.java.net/jeps/312: other JVMs has this feature from a long time, but I'm happy we're getting there :) eg Zing 

Il ven 19 ott 2018, 05:53 David Holmes via Concurrency-interest <[hidden email]> ha scritto:

Correction: new GCs will still have some global safepoints.

 

David

 

From: Concurrency-interest <[hidden email]> On Behalf Of David Holmes via Concurrency-interest
Sent: Friday, October 19, 2018 7:02 AM
To: 'Alex Otenko' <[hidden email]>
Cc: David Holmes <[hidden email]>; 'concurrency-interest' <[hidden email]>
Subject: Re: [concurrency-interest] Overhead of ThreadLocal data

 

The VM doesn’t try to determine if a thread is on CPU and if it did it would do no good as it can’t know when it comes back on CPU. A thread executes in a number of possible safepoint related states, mainly: in_java, in_VM, in_native and blocked. The first two require that the VMThread wait until the thread notices the safepoint request and changes to a safe state – there is where delay can occur. The latter two states are safepoint-safe and a thread can’t transition out of those states if a safepoint is in progress. The “blocked” state is for synchronization operations and sleeps, not I/O, which is typically performed in_Native. There are times when a thread goes “native” without a state transition (e.g. fast JNI field accesses) so those operations have to be short to avoid safepoint delays.

 

Hotspot is moving away from global safepoints where possible to use thread-directed handshakes. New GC’s will take advantage of this but old ones will still utilize a global safepoint for at least part of the GC operation.

 

David

 

From: Concurrency-interest <[hidden email]> On Behalf Of Alex Otenko via Concurrency-interest
Sent: Friday, October 19, 2018 3:43 AM
To: Nathan and Ila Reynolds <[hidden email]>
Cc: [hidden email]
Subject: Re: [concurrency-interest] Overhead of ThreadLocal data

 

This is strange. While the page is being loaded, the thread is not on the CPU. Does the GC wait for all the threads to get back on the CPU? Or is there some other reason that doesn't allow to find out whether the thread is on the CPU?

Alex

On 18 Oct 2018 18:38, "Nathan and Ila Reynolds" <[hidden email]> wrote:

When a thread reads a memory mapped area, it does a load instruction.  If the page is not in RAM or the process's page table, then the thread will exit user land and enter the kernel.  The thread could then load the page from disk.  This could take a while especially if the disk is heavily loaded.  Meanwhile, the JVM could be waiting for the thread to reach a safepoint.  It won't be able to until the thread returns from loading the page from disk.

With regular file I/O, the thread can be blocked inside the kernel.  When it returns from the kernel, it checks a flag and sees that it should block.  Thus, the thread does not execute Java code during a stop-the-world operation.  Hence, the JVM can assume the thread has reached a safepoint.

-Nathan

On 10/18/2018 11:06 AM, Alex Otenko wrote:

No, it doesn't. Accessing memory mapped area is off heap, so what makes it less safe than JNI?

On 18 Oct 2018 17:58, "Francesco Nigro via Concurrency-interest" <[hidden email]> wrote:

@nathan 

The point that Andrew raised is more related to safepoints: on MappedByteBuffer access you're in the JVM land and you're working between safepoints ie you got blocked by the kernel means that you won't arrive soon to the next safepoint, delaying the others that are waiting for you to join.

With FileChannel you're in the JNI that is worlking *in* a safepoint and won't delay any java mutator threads to reach it (if needed). Makes sense? 

 

Il giorno gio 18 ott 2018 alle ore 18:03 Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> ha scritto:

When accessing a file through the kernel's file I/O, the thread context
switches from user land into kernel land.  The thread then goes to the
file cache to see if the data is there.  If not, the kernel blocks the
thread and pulls the data from disk.  Once the data is in the file
cache, the thread copies the data into the user land buffer and context
switches back to user land.

When accessing a file through memory mapped I/O, the thread does a load
instruction against RAM.  If the data is not in RAM, then the thread
switches to the kernel, blocks while pulling data from disk and resumes
operation.

File I/O and memory mapped I/O do the same operations but in a different
order.  The difference is key.  With file I/O, the thread has to context
switch into the kernel with every access. Thus, we use large buffers to
minimize the performance impact of the kernel round trip.  It is the
context switch with every operation that hurts file I/O and where memory
mapped I/O shines. So, memory mapped I/O does well at concurrent random
reads of large files, but comes with an initialization cost and isn't
the best solution for all file access.  I have found that file I/O
considerably outperforms memory mapped I/O when sequentially reading and
writing to a file unless you can map the entire file in one large piece.

-Nathan

On 10/18/2018 3:58 AM, Andrew Haley wrote:


> On 10/18/2018 10:29 AM, Andrey Pavlenko wrote:
>> In case of a non-large file and/or sequential reads MappedByteBuffer is
>> definitely faster, but the question here is about *concurrent* *random*
>> reads of *large* files. For this purpose MappedByteBuffer may be
>> inefficient.
> Perhaps. It might be that manually managing memory (by reading and
> writing the parts of the file you need) works better than letting the
> kernel do it, but the kernel will cache as much as it can anyway, so
> it's not as if it'll necessarily save memory or reduce disk activity.
> There are advantages to using read() for sequential file access because
> the kernel can automatically read and cache the next part of the file.
>
> There were some problems with inefficient code generation for byte
> buffers but we have worked on that and it's better now, with (even)
> more improvements to come. Unless there are kernel issues I don't know
> about, mapped files are excellent for random access, and the Java
> ByteBuffer operations generate excellent code. (This isn't guaranteed
> because C2 uses a bunch of heuristics, but it's usually good.)
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 10/19/2018 03:27 AM, David Holmes via Concurrency-interest wrote:
> Correction: new GCs will still have some global safepoints.

And it's really important to remember that it's not just about the GCs: there is much
shared metadata that has to be scanned at safepoints.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
It is interesting that people have problems with short-lived ThreadLocals because ThreadLocal tries hard to expunge stale entries gradually during use. Even get()  tries to do it if the access happens to miss the home slot. Repeated access to the same entry should not have to skip stale entries over and over again. There seems to be special use pattern(s) that provoke problems. Studying them more deeply could show us the weakness of current ThreadLocal auto-expunge strategy so it could be improved and new method would not be needed...

Maybe the problem is not caused by unexpunged stale entries. It may be related to degeneration of hashtable that leads to sub-optimal placement of live entries. In that case, perhaps triggering a re-hash automatically when get() encounters many live entries it has to skip would help.

So those with problems, please be more specific about them. Can you show us a reproducer?

Regards, Peter

On 10/17/2018 08:26 PM, Doug Lea via Concurrency-interest wrote:
[+list]

On 10/17/18 11:44 AM, Nathan and Ila Reynolds wrote:
Can we add the following method to ThreadLocal?

public static void expungeStaleEntries()
This seems like a reasonable request (although perhaps with an improved
name).  The functionality exists internally, and it seems overly
parental not to export it for use as a band-aid by those people who have
tried and otherwise failed to solve the zillions of short-lived
ThreadLocals in long-lived threads problem.

Can anyone think of a reason not to do this?

-Doug

This method will call ThreadLocal.ThreadLocalMap.expungeStaleEntries()
for the ThreadLocalMap of the current thread.  Thread pools can then
call this method when the thread finishes processing a job after GC.
This solves the problem of zillions of short-lived ThreadLocals in
long-lived threads.  
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
The problem is deeper than that; people using TL because it's what we've
got, when really what people want (in various situations) is {
processor, frame, task } locals, and TL is the best approximation we
have.  Over in Project Loom, there's a deeper exploration going on of
the use cases, and what additional mechanisms might help.

On 10/19/2018 6:12 AM, Peter Levart via Concurrency-interest wrote:

> It is interesting that people have problems with short-lived
> ThreadLocals because ThreadLocal tries hard to expunge stale entries
> gradually during use. Even get()  tries to do it if the access happens
> to miss the home slot. Repeated access to the same entry should not have
> to skip stale entries over and over again. There seems to be special use
> pattern(s) that provoke problems. Studying them more deeply could show
> us the weakness of current ThreadLocal auto-expunge strategy so it could
> be improved and new method would not be needed...
>
> Maybe the problem is not caused by unexpunged stale entries. It may be
> related to degeneration of hashtable that leads to sub-optimal placement
> of live entries. In that case, perhaps triggering a re-hash
> automatically when get() encounters many live entries it has to skip
> would help.
>
> So those with problems, please be more specific about them. Can you show
> us a reproducer?
>
> Regards, Peter
>
> On 10/17/2018 08:26 PM, Doug Lea via Concurrency-interest wrote:
>> [+list]
>>
>> On 10/17/18 11:44 AM, Nathan and Ila Reynolds wrote:
>>> Can we add the following method to ThreadLocal?
>>>
>>> public static void expungeStaleEntries()
>> This seems like a reasonable request (although perhaps with an improved
>> name).  The functionality exists internally, and it seems overly
>> parental not to export it for use as a band-aid by those people who have
>> tried and otherwise failed to solve the zillions of short-lived
>> ThreadLocals in long-lived threads problem.
>>
>> Can anyone think of a reason not to do this?
>>
>> -Doug
>>
>>> This method will call ThreadLocal.ThreadLocalMap.expungeStaleEntries()
>>> for the ThreadLocalMap of the current thread.  Thread pools can then
>>> call this method when the thread finishes processing a job after GC.
>>> This solves the problem of zillions of short-lived ThreadLocals in
>>> long-lived threads.
>> _______________________________________________
>> Concurrency-interest mailing list
>> [hidden email]
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
A while ago, I raised the request for a processor local. Striping a
highly contended cache line, reduces contention but the cache line still
bounces around cores.  Using processor local, each cache line can be
assigned to a core and it remains in the exclusive state for that core. 
Atomic operations on that cache line are the fastest possible.

My original example was a reader-writer lock which is almost exclusively
used for read operations.  However, with processor local, we can
implement uncontended counters for heavily used queues.

-Nathan

On 10/19/2018 7:06 AM, Brian Goetz wrote:

> The problem is deeper than that; people using TL because it's what
> we've got, when really what people want (in various situations) is {
> processor, frame, task } locals, and TL is the best approximation we
> have.  Over in Project Loom, there's a deeper exploration going on of
> the use cases, and what additional mechanisms might help.
>
> On 10/19/2018 6:12 AM, Peter Levart via Concurrency-interest wrote:
>> It is interesting that people have problems with short-lived
>> ThreadLocals because ThreadLocal tries hard to expunge stale entries
>> gradually during use. Even get() tries to do it if the access happens
>> to miss the home slot. Repeated access to the same entry should not
>> have to skip stale entries over and over again. There seems to be
>> special use pattern(s) that provoke problems. Studying them more
>> deeply could show us the weakness of current ThreadLocal auto-expunge
>> strategy so it could be improved and new method would not be needed...
>>
>> Maybe the problem is not caused by unexpunged stale entries. It may
>> be related to degeneration of hashtable that leads to sub-optimal
>> placement of live entries. In that case, perhaps triggering a re-hash
>> automatically when get() encounters many live entries it has to skip
>> would help.
>>
>> So those with problems, please be more specific about them. Can you
>> show us a reproducer?
>>
>> Regards, Peter
>>
>> On 10/17/2018 08:26 PM, Doug Lea via Concurrency-interest wrote:
>>> [+list]
>>>
>>> On 10/17/18 11:44 AM, Nathan and Ila Reynolds wrote:
>>>> Can we add the following method to ThreadLocal?
>>>>
>>>> public static void expungeStaleEntries()
>>> This seems like a reasonable request (although perhaps with an improved
>>> name).  The functionality exists internally, and it seems overly
>>> parental not to export it for use as a band-aid by those people who
>>> have
>>> tried and otherwise failed to solve the zillions of short-lived
>>> ThreadLocals in long-lived threads problem.
>>>
>>> Can anyone think of a reason not to do this?
>>>
>>> -Doug
>>>
>>>> This method will call ThreadLocal.ThreadLocalMap.expungeStaleEntries()
>>>> for the ThreadLocalMap of the current thread.  Thread pools can then
>>>> call this method when the thread finishes processing a job after GC.
>>>> This solves the problem of zillions of short-lived ThreadLocals in
>>>> long-lived threads.
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>> [hidden email]
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
>>
>> _______________________________________________
>> Concurrency-interest mailing list
>> [hidden email]
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
The problem with “short lived” thread locals that do not explicitly use remove() when they are no longer needed, is that in those situations “short lived” means “until the next GC detects that the ThreadLocal is no longer reachable, and the weak reference to that ThreadLocal starts returning nulls”. Scanning for stale entries more frequently, or on an API call, won’t help these situations because the map entries are not “stale” until the collector determines the unreachability of their related ThreadLocal instances. When short lived ThreadLocals are created and die at a rate linear to application throughout, this leads to ThreadLocal maps in active threads holding thousands or millions of entries, to extremely long collision chains, and to situations where half the CPU is spent walking those chains in ThreadLocal.get().

Where would such a ThreadLocal instance creation rate come from? We’ve seen it happen in the wild in quite a few places. I initially thought of it as an application or library-level misuse of a ThreadLocal, but with each instance of juc ReentrantReadWriteLock potentially using a ThreadLocal instance for internal coordination, application and library writers that use ReentrantReadWriteLock in a seemingly idiomatic way are often unaware of the ThreadLocal implications. A simple, seemingly valid pattern for using ReentrantReadWriteLock would be a system where arriving work “packets” are handled by parallel threads (doing parallel work within the work packet), where each work packet uses its own of ReentrantReadWriteLock instance for coordination of the parallelism within the packet.

With G1 (and C4, and likely Shenandoah and ZGC), with the current ThreadLocal implementation, the “means “until the next GC detects that the ThreadLocal is no longer reachable” meaning gets compounded by the fact that collisions in the map will actually prevent otherwise-unreachable ThreadLocal instances from becoming unreachable. As long as get() calls for other (chain colliding) ThreadLocals are actively performed on any thread, where “actively” means “more than once during a mark cycle”, the weakrefs from the colliding entries get strengthened during the mark, preventing the colliding ThreadLocal instances from dying in the given GC cycle.. Stop-the-world newgen provides a bit of a filter for short-lived-enough ThreadLocals for G1, but for ThreadLocals with mid-term lifecycles (which will naturally occur as queues in a work system as described above grow under load), or without such a STW newgen (which none of the newer collectors have or want to have have), this situation leads to explosive ThreadLocal map growth under high-enough throughout situations.

We’ve created a tweaked implementation of ThreadLocal that avoids the weakref get()-strengthening problem with no semantic change, by having ThreadLocal get() determine identity by comparing a ThreadLocal ID (a unique long value per ThreadLocal instance, assigned at ThreadLocal instantiation) rather than comparing the outcome of a weakref get() for each entry in the chain walked when looking for a match. This makes weakref get()s only occur on the actual entry you are looking up, and not on any colliding entries, this preventing the gets from keeping dead ThreadLocals alive. We’ve been using this implementation successfully with C4 in Zing, and would be happy to share and upstream if there is interest.

Sent from my iPad

> On Oct 19, 2018, at 6:45 AM, Nathan and Ila Reynolds via Concurrency-interest <[hidden email]> wrote:
>
> A while ago, I raised the request for a processor local. Striping a highly contended cache line, reduces contention but the cache line still bounces around cores.  Using processor local, each cache line can be assigned to a core and it remains in the exclusive state for that core.  Atomic operations on that cache line are the fastest possible.
>
> My original example was a reader-writer lock which is almost exclusively used for read operations.  However, with processor local, we can implement uncontended counters for heavily used queues.
>
> -Nathan
>
>> On 10/19/2018 7:06 AM, Brian Goetz wrote:
>> The problem is deeper than that; people using TL because it's what we've got, when really what people want (in various situations) is { processor, frame, task } locals, and TL is the best approximation we have.  Over in Project Loom, there's a deeper exploration going on of the use cases, and what additional mechanisms might help.
>>
>>> On 10/19/2018 6:12 AM, Peter Levart via Concurrency-interest wrote:
>>> It is interesting that people have problems with short-lived ThreadLocals because ThreadLocal tries hard to expunge stale entries gradually during use. Even get() tries to do it if the access happens to miss the home slot. Repeated access to the same entry should not have to skip stale entries over and over again. There seems to be special use pattern(s) that provoke problems. Studying them more deeply could show us the weakness of current ThreadLocal auto-expunge strategy so it could be improved and new method would not be needed...
>>>
>>> Maybe the problem is not caused by unexpunged stale entries. It may be related to degeneration of hashtable that leads to sub-optimal placement of live entries. In that case, perhaps triggering a re-hash automatically when get() encounters many live entries it has to skip would help.
>>>
>>> So those with problems, please be more specific about them. Can you show us a reproducer?
>>>
>>> Regards, Peter
>>>
>>>> On 10/17/2018 08:26 PM, Doug Lea via Concurrency-interest wrote:
>>>> [+list]
>>>>
>>>>> On 10/17/18 11:44 AM, Nathan and Ila Reynolds wrote:
>>>>> Can we add the following method to ThreadLocal?
>>>>>
>>>>> public static void expungeStaleEntries()
>>>> This seems like a reasonable request (although perhaps with an improved
>>>> name).  The functionality exists internally, and it seems overly
>>>> parental not to export it for use as a band-aid by those people who have
>>>> tried and otherwise failed to solve the zillions of short-lived
>>>> ThreadLocals in long-lived threads problem.
>>>>
>>>> Can anyone think of a reason not to do this?
>>>>
>>>> -Doug
>>>>
>>>>> This method will call ThreadLocal.ThreadLocalMap.expungeStaleEntries()
>>>>> for the ThreadLocalMap of the current thread.  Thread pools can then
>>>>> call this method when the thread finishes processing a job after GC.
>>>>> This solves the problem of zillions of short-lived ThreadLocals in
>>>>> long-lived threads.
>>>> _______________________________________________
>>>> Concurrency-interest mailing list
>>>> [hidden email]
>>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>
>>>
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>> [hidden email]
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list
Hi Gil,

On 10/19/2018 05:09 PM, Gil Tene wrote:
We’ve created a tweaked implementation of ThreadLocal that avoids the weakref get()-strengthening problem with no semantic change, by having ThreadLocal get() determine identity by comparing a ThreadLocal ID (a unique long value per ThreadLocal instance, assigned at ThreadLocal instantiation) rather than comparing the outcome of a weakref get() for each entry in the chain walked when looking for a match. This makes weakref get()s only occur on the actual entry you are looking up, and not on any colliding entries, this preventing the gets from keeping dead ThreadLocals alive. We’ve been using this implementation successfully with C4 in Zing, and would be happy to share and upstream if there is interest.

Very interesting. One question though. How does your implementation do expunging of stale entries? Just when there is a re-hash needed because of capacity growing over threshold? I ask because in order to detect that and entry is stale, you have to call get(). The only other variant is to have the weakrefs enqueued and then poll the queue for them. But that already interacts with a reference handling thread and I had the impression that ThreadLocal is designed to have no synchronization whatsoever.

Regards, Peter


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Overhead of ThreadLocal data

JSR166 Concurrency mailing list


Sent from Gil's iPhone

> On Oct 19, 2018, at 9:06 AM, Peter Levart <[hidden email]> wrote:
>
> Hi Gil,
>
>> On 10/19/2018 05:09 PM, Gil Tene wrote:
>> We’ve created a tweaked implementation of ThreadLocal that avoids the weakref get()-strengthening problem with no semantic change, by having ThreadLocal get() determine identity by comparing a ThreadLocal ID (a unique long value per ThreadLocal instance, assigned at ThreadLocal instantiation) rather than comparing the outcome of a weakref get() for each entry in the chain walked when looking for a match. This makes weakref get()s only occur on the actual entry you are looking up, and not on any colliding entries, this preventing the gets from keeping dead ThreadLocals alive. We’ve been using this implementation successfully with C4 in Zing, and would be happy to share and upstream if there is interest.
>
> Very interesting. One question though. How does your implementation do expunging of stale entries? Just when there is a re-hash needed because of capacity growing over threshold? I ask because in order to detect that and entry is stale, you have to call get(). The only other variant is to have the weakrefs enqueued and then poll the queue for them. But that already interacts with a reference handling thread and I had the impression that ThreadLocal is designed to have no synchronization whatsoever.

We use a reference queue, and poll on it to detect and deal with entries that (weakly) refereed to dead ThreadLocals.

>
> Regards, Peter
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
1234