Why don't FJP workers steal safe but unrelated tasks during Task.join?
Hi, this is question about the FJP implementation.
I noticed that worker threads will help only certain other tasks when they are joining a task that is currently in progress before blocking on the task being joined. I understand that this is, in part, for safety reasons - it is simply unsafe to help with certain tasks when such attempts could lead to execution deadlock (starting to work on a task that then attempts to join a task the helping thread has commenced, but not finished).
But there are unrelated tasks that would be safe to work on while waiting for a to-be-joined task to complete, yet the FJP workers do not attempt to steal them. My question is why.
In my own threadpool implementation, I made the observation that a strategy in which workers will steal any safe task instead of blocking on a to-be-joined task leads to very poor performance (and to make matters worse, creates widely varying performance even with the very same task workload in multiple runs). Specifically, if unrelated tasks are stolen, a chain of task execution/joining/helping/... occurs in which most worker threads will eventually run out of tasks to steal, resulting in a tail of serial execution to untangle these tasks. This is for a fully-strict implementation of quicksort, btw. It's a tail of serial execution because my pool does not employ compensatory threads - if it did, I'd likely see their numbers go up rapidly.
My question is: how did we learn that helping with tasks that are unrelated to the task being joined is a bad idea, or is there a way to do that? I believe it is not addressed in the implementation notes in the FJP implementation, which focus only on the safety aspect.
Re: Why don't FJP workers steal safe but unrelated tasks during Task.join?
> My question is: how did we learn that helping with tasks that are
> to the task being joined is a bad idea, or is there a way to do that?
Wait-for paths are in general shortest if joiners only try to steal
back from their stealers; this also avoids deadlock scenarios.
Which is what FJP does (and sounds like what you do as well).
The precise effects are hard to characterize in the presence of randomness,
arbitrary stalls, etc. There has not been much analytic work on this
beyond showing worst cases (wrt deadlock and/or time-space), and to
note that some issues can be evaded by using explicit completions (as in
CountedCompleter that identify tasks that can be performed next without
blocking or search) at the price of more task objects and less pleasant
programming. (These are used inside jdk8 parallel Streams.)
If you are interested in pursuing this, you might read through
the few papers discussing it, starting with the "leap-frogging"
paper in the FJP docs.