ForkJoinPool with custom number of WorkQueues

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

ForkJoinPool with custom number of WorkQueues

JSR166 Concurrency mailing list
Hi,

I am using ForkJoinPool as a means to avoid lock contention for submitting tasks to an executor.  FJP is much faster than before, but has some unexpected slowdown when there is not enough work.   In particular, A significant amount of time is spent waiting parking and unparking threads when there no work to do.  When there is no active work, it seems each worker scans the other work queues looking for work before going to sleep.  

In my program I have a parallelism of 64, because when the work load is high, each of the threads can be active.  However, when work load is low, the workers spend too much time looking for more work.   

One way to fix this (I think) is to lower the number of worker queues, but keep the same number of workers.   In my case, having 32 or 16 queues rather than having exactly 64 might help, but I have no way of testing it out.   Has this ever been brought up, or is it possible for me to easily patch FJP to see if it would help?



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: ForkJoinPool with custom number of WorkQueues

JSR166 Concurrency mailing list
In the past implementations of the FJ pool there was a SPIN static variable used for this purpose that where quite good (with parallelism<available cores) to not park immediately the executor threads where is nothing to do.

I have the same problem of Carl and I have found a similar "solution": reduce the parallelism hoping to get the queues more busy, but it would in incour in an higher contention on offer side. 

It wouldn't make sense to inject some WaitStrategy to allow the user to choose what to do before parking (if parking)?

Il giorno ven 9 nov 2018, 00:18 Carl Mastrangelo via Concurrency-interest <[hidden email]> ha scritto:
Hi,

I am using ForkJoinPool as a means to avoid lock contention for submitting tasks to an executor.  FJP is much faster than before, but has some unexpected slowdown when there is not enough work.   In particular, A significant amount of time is spent waiting parking and unparking threads when there no work to do.  When there is no active work, it seems each worker scans the other work queues looking for work before going to sleep.  

In my program I have a parallelism of 64, because when the work load is high, each of the threads can be active.  However, when work load is low, the workers spend too much time looking for more work.   

One way to fix this (I think) is to lower the number of worker queues, but keep the same number of workers.   In my case, having 32 or 16 queues rather than having exactly 64 might help, but I have no way of testing it out.   Has this ever been brought up, or is it possible for me to easily patch FJP to see if it would help?


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: ForkJoinPool with custom number of WorkQueues

JSR166 Concurrency mailing list
I've also wanted a WaitStrategy so if that's on the table then I'd be voting for that.

On Fri, Nov 9, 2018 at 8:54 AM Francesco Nigro via Concurrency-interest <[hidden email]> wrote:
In the past implementations of the FJ pool there was a SPIN static variable used for this purpose that where quite good (with parallelism<available cores) to not park immediately the executor threads where is nothing to do.

I have the same problem of Carl and I have found a similar "solution": reduce the parallelism hoping to get the queues more busy, but it would in incour in an higher contention on offer side. 

It wouldn't make sense to inject some WaitStrategy to allow the user to choose what to do before parking (if parking)?

Il giorno ven 9 nov 2018, 00:18 Carl Mastrangelo via Concurrency-interest <[hidden email]> ha scritto:
Hi,

I am using ForkJoinPool as a means to avoid lock contention for submitting tasks to an executor.  FJP is much faster than before, but has some unexpected slowdown when there is not enough work.   In particular, A significant amount of time is spent waiting parking and unparking threads when there no work to do.  When there is no active work, it seems each worker scans the other work queues looking for work before going to sleep.  

In my program I have a parallelism of 64, because when the work load is high, each of the threads can be active.  However, when work load is low, the workers spend too much time looking for more work.   

One way to fix this (I think) is to lower the number of worker queues, but keep the same number of workers.   In my case, having 32 or 16 queues rather than having exactly 64 might help, but I have no way of testing it out.   Has this ever been brought up, or is it possible for me to easily patch FJP to see if it would help?


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


--
Cheers,

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: ForkJoinPool with custom number of WorkQueues

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list

I agree that something should be done to improve performance in cases
where the optimal steady state is to engage only a subset of available
worker threads. I'll explore other options (which has been on todo-list
for too long). None of those we've tried so far are very good:


On 11/9/18 2:52 AM, Francesco Nigro via Concurrency-interest wrote:
> In the past implementations of the FJ pool there was a SPIN static
> variable used for this purpose that where quite good (with
> parallelism<available cores) to not park immediately the executor
> threads where is nothing to do.

... with the disadvantage of unacceptably high CPU wastage on machines
running near CPU saturation, as we discovered only after releasing, and
so later removed. There are now only a few j.u.c classes that include
paths allowing a non-trivial amount of spinning, and some of these might
be subject to reconsideration in light of Loom, in which spinning is
rarely a good idea.

>
> I have the same problem of Carl and I have found a similar "solution":
> reduce the parallelism hoping to get the queues more busy, but it would
> in incour in an higher contention on offer side.

When steady state has little variance in task execution rates, reducing
parallelism *is* the right solution. But has the disadvantage of poorer
throughput when a lot of tasks suddenly appear.

>
> It wouldn't make sense to inject some WaitStrategy to allow the user to
> choose what to do before parking (if parking)?

I'm not a big fan of forcing users to make such decisions, since they
can/will encounter the same issues as algorithmic solutions do, as in
Carl's situation below.

>
> Il giorno ven 9 nov 2018, 00:18 Carl Mastrangelo via
> Concurrency-interest <[hidden email]
> <mailto:[hidden email]>> ha scritto:
>
>     Hi,
>
>     I am using ForkJoinPool as a means to avoid lock contention for
>     submitting tasks to an executor.  FJP is much faster than before,
>     but has some unexpected slowdown when there is not enough work.   In
>     particular, A significant amount of time is spent waiting parking
>     and unparking threads when there no work to do.  When there is no
>     active work, it seems each worker scans the other work queues
>     looking for work before going to sleep.  
>
>     In my program I have a parallelism of 64, because when the work load
>     is high, each of the threads can be active.  However, when work load
>     is low, the workers spend too much time looking for more work.

Besides keeping too many threads alive, the main performance issue here
is reduced locality because related tasks are taken by different
threads. So ...
   
>
>     One way to fix this (I think) is to lower the number of worker
>     queues, but keep the same number of workers.   In my case, having 32
>     or 16 queues rather than having exactly 64 might help, but I have no
>     way of testing it out.   Has this ever been brought up, or is it
>     possible for me to easily patch FJP to see if it would help?

... I don't think this will be a dramatic improvement, but if you are
interested in experimenting, you could try building a jsr166.jar after
changing ForkJoinPool line 703 from:
    static final int SQMASK       = 0x007e;   // max 64 (even) slots
    static final int SQMASK       = 0x001e;   // max 16 (even) slots

-Doug


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: ForkJoinPool with custom number of WorkQueues

JSR166 Concurrency mailing list
Answers inline

There are now only a few j.u.c classes that include
paths allowing a non-trivial amount of spinning, and some of these might
be subject to reconsideration in light of Loom, in which spinning is
rarely a good idea

Good point: most busy spin approaches (or Thread::onSpinWait patterns) should be avoided just for this reason:
to not count the recent issues re Intel and the pause instrinsic IIRC.

I'm not a big fan of forcing users to make such decisions, since they
can/will encounter the same issues as algorithmic solutions do, as in
Carl's situation below.

That's true for any fine tuning related to specific performance issue I believe.
That's why I propose to not have this tuning in the hands of the "average" (2 be defined) user, but instead to default 
on the park approach and just via a protected constructor accessible only via inheritance the full fat feature 
(ie customizable IdleStrategy/WaitStrategy et similia). Any defensive approach is wellcome here :)
I perfectly understand that there are just few devs in the world that need it, but given that I'm one of 
them I can't avoid to propose it respecting your need to not complicate the life to the other 99,9999% of users, 
but I do understand if it won't seems safe enough.
 

Il giorno dom 11 nov 2018 alle ore 15:25 Doug Lea via Concurrency-interest <[hidden email]> ha scritto:

I agree that something should be done to improve performance in cases
where the optimal steady state is to engage only a subset of available
worker threads. I'll explore other options (which has been on todo-list
for too long). None of those we've tried so far are very good:


On 11/9/18 2:52 AM, Francesco Nigro via Concurrency-interest wrote:
> In the past implementations of the FJ pool there was a SPIN static
> variable used for this purpose that where quite good (with
> parallelism<available cores) to not park immediately the executor
> threads where is nothing to do.

... with the disadvantage of unacceptably high CPU wastage on machines
running near CPU saturation, as we discovered only after releasing, and
so later removed. There are now only a few j.u.c classes that include
paths allowing a non-trivial amount of spinning, and some of these might
be subject to reconsideration in light of Loom, in which spinning is
rarely a good idea.

>
> I have the same problem of Carl and I have found a similar "solution":
> reduce the parallelism hoping to get the queues more busy, but it would
> in incour in an higher contention on offer side.

When steady state has little variance in task execution rates, reducing
parallelism *is* the right solution. But has the disadvantage of poorer
throughput when a lot of tasks suddenly appear.

>
> It wouldn't make sense to inject some WaitStrategy to allow the user to
> choose what to do before parking (if parking)?

I'm not a big fan of forcing users to make such decisions, since they
can/will encounter the same issues as algorithmic solutions do, as in
Carl's situation below.

>
> Il giorno ven 9 nov 2018, 00:18 Carl Mastrangelo via
> Concurrency-interest <[hidden email]
> <mailto:[hidden email]>> ha scritto:
>
>     Hi,
>
>     I am using ForkJoinPool as a means to avoid lock contention for
>     submitting tasks to an executor.  FJP is much faster than before,
>     but has some unexpected slowdown when there is not enough work.   In
>     particular, A significant amount of time is spent waiting parking
>     and unparking threads when there no work to do.  When there is no
>     active work, it seems each worker scans the other work queues
>     looking for work before going to sleep.  
>
>     In my program I have a parallelism of 64, because when the work load
>     is high, each of the threads can be active.  However, when work load
>     is low, the workers spend too much time looking for more work.

Besides keeping too many threads alive, the main performance issue here
is reduced locality because related tasks are taken by different
threads. So ...
   
>
>     One way to fix this (I think) is to lower the number of worker
>     queues, but keep the same number of workers.   In my case, having 32
>     or 16 queues rather than having exactly 64 might help, but I have no
>     way of testing it out.   Has this ever been brought up, or is it
>     possible for me to easily patch FJP to see if it would help?

... I don't think this will be a dramatic improvement, but if you are
interested in experimenting, you could try building a jsr166.jar after
changing ForkJoinPool line 703 from:
    static final int SQMASK       = 0x007e;   // max 64 (even) slots
    static final int SQMASK       = 0x001e;   // max 16 (even) slots

-Doug


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: ForkJoinPool with custom number of WorkQueues

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
Response inline:

On Sun, Nov 11, 2018 at 6:25 AM Doug Lea via Concurrency-interest <[hidden email]> wrote:

I agree that something should be done to improve performance in cases
where the optimal steady state is to engage only a subset of available
worker threads. I'll explore other options (which has been on todo-list
for too long). None of those we've tried so far are very good:


On 11/9/18 2:52 AM, Francesco Nigro via Concurrency-interest wrote:
> In the past implementations of the FJ pool there was a SPIN static
> variable used for this purpose that where quite good (with
> parallelism<available cores) to not park immediately the executor
> threads where is nothing to do.

... with the disadvantage of unacceptably high CPU wastage on machines
running near CPU saturation, as we discovered only after releasing, and
so later removed. There are now only a few j.u.c classes that include
paths allowing a non-trivial amount of spinning, and some of these might
be subject to reconsideration in light of Loom, in which spinning is
rarely a good idea.

>
> I have the same problem of Carl and I have found a similar "solution":
> reduce the parallelism hoping to get the queues more busy, but it would
> in incour in an higher contention on offer side.

When steady state has little variance in task execution rates, reducing
parallelism *is* the right solution. But has the disadvantage of poorer
throughput when a lot of tasks suddenly appear.

>
> It wouldn't make sense to inject some WaitStrategy to allow the user to
> choose what to do before parking (if parking)?

I'm not a big fan of forcing users to make such decisions, since they
can/will encounter the same issues as algorithmic solutions do, as in
Carl's situation below.

>
> Il giorno ven 9 nov 2018, 00:18 Carl Mastrangelo via
> Concurrency-interest <[hidden email]
> <mailto:[hidden email]>> ha scritto:
>
>     Hi,
>
>     I am using ForkJoinPool as a means to avoid lock contention for
>     submitting tasks to an executor.  FJP is much faster than before,
>     but has some unexpected slowdown when there is not enough work.   In
>     particular, A significant amount of time is spent waiting parking
>     and unparking threads when there no work to do.  When there is no
>     active work, it seems each worker scans the other work queues
>     looking for work before going to sleep.  
>
>     In my program I have a parallelism of 64, because when the work load
>     is high, each of the threads can be active.  However, when work load
>     is low, the workers spend too much time looking for more work.

Besides keeping too many threads alive, the main performance issue here
is reduced locality because related tasks are taken by different
threads. So ...
   
>
>     One way to fix this (I think) is to lower the number of worker
>     queues, but keep the same number of workers.   In my case, having 32
>     or 16 queues rather than having exactly 64 might help, but I have no
>     way of testing it out.   Has this ever been brought up, or is it
>     possible for me to easily patch FJP to see if it would help?

... I don't think this will be a dramatic improvement, but if you are
interested in experimenting, you could try building a jsr166.jar after
changing ForkJoinPool line 703 from:
    static final int SQMASK       = 0x007e;   // max 64 (even) slots
    static final int SQMASK       = 0x001e;   // max 16 (even) slots

Thanks for the tip.   I tried this out, (and 0x003e), and as you said it hardly helped out.  In my profile, most of the time was moved from park/unpark over to scan, making it mostly a wash.

Here is what my profiler says:

32 parallelism, 16 workqueues
8.43 s 23.33% ForkJoinPool2.scan(ForkJoinPool2$WorkQueue, int) (ForkJoinPool2.java)
360 ms 0.996% ForkJoinPool2.externalPush(ForkJoinTask2) (ForkJoinPool2.java)
300 ms 0.830% ForkJoinPool2.signalWork(ForkJoinPool2$WorkQueue[], ForkJoinPool2$WorkQueue) (ForkJoinPool2.java)
270 ms 0.747% ForkJoinPool2.awaitWork(ForkJoinPool2$WorkQueue, int) (ForkJoinPool2.java)
160 ms 0.443% ForkJoinPool2$WorkQueue.runTask(ForkJoinTask2) (ForkJoinPool2.java)
140 ms 0.387% ForkJoinPool2.runWorker(ForkJoinPool2$WorkQueue) (ForkJoinPool2.java)
90 ms 0.249% ForkJoinPool2.execute(java.lang.Runnable) (ForkJoinPool2.java)
50 ms 0.138% ForkJoinTask2.setCompletion(int) (ForkJoinTask2.java)
20 ms 0.055% ForkJoinTask2.doExec (ForkJoinTask2.java)
10 ms 0.028% ForkJoinPool2.tryRelease(long, ForkJoinPool2$WorkQueue, long) (ForkJoinPool2.java)
10 ms 0.028% ForkJoinTask2$RunnableExecuteAction.exec (ForkJoinTask2.java)


32 parallelism, 32 workqueues
6.82 s 18.80% ForkJoinPool2.scan(ForkJoinPool2$WorkQueue, int) (ForkJoinPool2.java)
480 ms 1.323% ForkJoinPool2.externalPush(ForkJoinTask2) (ForkJoinPool2.java)
400 ms 1.103% ForkJoinPool2.awaitWork(ForkJoinPool2$WorkQueue, int) (ForkJoinPool2.java)
320 ms 0.882% ForkJoinPool2.signalWork(ForkJoinPool2$WorkQueue[], ForkJoinPool2$WorkQueue) (ForkJoinPool2.java)
160 ms 0.441% ForkJoinTask2.doExec (ForkJoinTask2.java)
110 ms 0.303% ForkJoinPool2$WorkQueue.runTask(ForkJoinTask2) (ForkJoinPool2.java)
50 ms 0.138% ForkJoinTask2.setCompletion(int) (ForkJoinTask2.java)
20 ms 0.055% ForkJoinPool2.tryRelease(long, ForkJoinPool2$WorkQueue, long) (ForkJoinPool2.java)
10 ms 0.028% ForkJoinPool2.runWorker(ForkJoinPool2$WorkQueue) (ForkJoinPool2.java)
10 ms 0.028% ForkJoinPool2$WorkQueue.execLocalTasks (ForkJoinPool2.java)
10 ms 0.028% ForkJoinPool2.execute(java.lang.Runnable) (ForkJoinPool2.java)


 

-Doug


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest