(hypothetical? bogus?) TSAN failure reported in ForkJoinPool (JDK11 and "refresh" versions)

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

(hypothetical? bogus?) TSAN failure reported in ForkJoinPool (JDK11 and "refresh" versions)

JSR166 Concurrency mailing list
Hi,

Lately we've been running most of our continuous builds with TSAN. When we updated to the JDK11 runtime, we started getting reports of a race when handing a task to a ForkJoinPool. I see the same race when running with the "refresh" ForkJoinPool (assuming that's what I still get when I download http://gee.cs.oswego.edu/dl/concurrent/dist/jsr166.jar nowadays).

The very rough form of the code is executing a job like this:

class Job implements Runnable {
  boolean b = true;

  public void run() {
    if (!b) { ... }
  }
}

TSAN reports a race between the write of b (in the submitting thread) and the read (in the pool thread).

After staring at ForkJoinPool code a little, I'm not completely clear on whether there's necessarily a happens-before edge between the plain write of the task and the acquire read, at least when the worker was already running beforehand. I see a subsequent casSlotToNull, but that's a weakCompareAndSet, so I don't think that helps?

But there is so much going on in ForkJoinPool that I won't pretend to understand -- some of which TSAN doesn't currently understand, either (e.g., manual fences). Plus, even if there is theoretically a race here, I don't know if any compiler/architecture would exploit it. (For what it's worth, I couldn't actually observe a stale value in my limited testing, even on PPC.)

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: (hypothetical? bogus?) TSAN failure reported in ForkJoinPool (JDK11 and "refresh" versions)

JSR166 Concurrency mailing list
Additional info:

We have worked around the immediate TSAN failure.

Unfortunately, I don't have a test case packed up for you to run. What I have is entangled with Google infrastructure (which is different from some JDK infrastructure Martin discussed last year).

But here is the code I've been running and a TSAN failure report. (TSAN gives an error maybe about half the time I run the test.) The report comes from a run against the "refresh" ForkJoinPool.

Again, the failure is a TSAN failure only: I haven't been able to see a stale value in practice, so my assertFalse check is always succeeding.

package com.google.common;

import static java.util.concurrent.TimeUnit.SECONDS;
import static org.junit.Assert.assertFalse;

import java.util.concurrent.ForkJoinPool;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;

@RunWith(JUnit4.class)
public class ForkJoinPoolTsanTest {
  @Test
  public void testHandoff() throws Exception {
    ForkJoinPool service = new ForkJoinPool();
    boolean[] failed = new boolean[1];
    for (int j = 0; j < 5; j++) {
      service.execute(new Job(failed));
    }
    service.shutdown();
    service.awaitTermination(10, SECONDS);
    assertFalse(failed[0]);
  }

  private static final class Job implements Runnable {
    final boolean[] failed;
    boolean b = true;

    Job(boolean[] failed) {
      this.failed = failed;
    }

    @Override
    public void run() {
      if (!b) {
        failed[0] = true;
      }
    }
  }
}

Read of size 1 at 0x0000cf6f8d54 by thread T18:
  #0 com.google.common.ForkJoinPoolTsanTest$Job.run()V ForkJoinPoolTsanTest.java:35
  #1 java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec()Z ForkJoinTask.java:1341
  #2 java.util.concurrent.ForkJoinTask.doExec()I ForkJoinTask.java:435
  #3 java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V ForkJoinPool.java:1128
  #4 java.util.concurrent.ForkJoinPool.scan(Ljava/util/concurrent/ForkJoinPool$WorkQueue;II)I ForkJoinPool.java:1598
  #5 java.util.concurrent.ForkJoinPool.runWorker(Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V ForkJoinPool.java:1565
  #6 java.util.concurrent.ForkJoinWorkerThread.run()V ForkJoinWorkerThread.java:138

Previous write of size 1 at 0x0000cf6f8d54 by thread T5:
  #0 com.google.common.ForkJoinPoolTsanTest$Job.<init>([Z)V ForkJoinPoolTsanTest.java:27
  #1 com.google.common.ForkJoinPoolTsanTest.testHandoff()V ForkJoinPoolTsanTest.java:18

Thread T18 (tid=661, running) created by thread T5 at:
  #0 pthread_create third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:965:3 (4766092b7397d478e6abfb6f139c4adc84ae7e74c64c75064c31eca5ef3d4346_02000308b290+0x36b82a)
  #1 os::create_thread(Thread*, os::ThreadType, unsigned long) <null> (libjvm.so+0x1148e95)
  #2 java.lang.Thread.start()V Thread.java:803
  #3 java.util.concurrent.ForkJoinPool.createWorker()Z ForkJoinPool.java:1408
  #4 java.util.concurrent.ForkJoinPool.signalWork()V ForkJoinPool.java:1531
  #5 java.util.concurrent.ForkJoinPool.externalPush(Ljava/util/concurrent/ForkJoinTask;)V ForkJoinPool.java:2128
  #6 java.util.concurrent.ForkJoinPool.externalSubmit(Ljava/util/concurrent/ForkJoinTask;)Ljava/util/concurrent/ForkJoinTask; ForkJoinPool.java:2143
  #7 java.util.concurrent.ForkJoinPool.execute(Ljava/lang/Runnable;)V ForkJoinPool.java:2587
  #8 com.google.common.ForkJoinPoolTsanTest.testHandoff()V ForkJoinPoolTsanTest.java:18

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: (hypothetical? bogus?) TSAN failure reported in ForkJoinPool (JDK11 and "refresh" versions)

JSR166 Concurrency mailing list
In JDK 8 ForkJoinPool used getObjectVolatile to get tasks from the work queue array. 
See the method scan(WorkQueue w, int r) from the class ForkJoinPool:
U.getObjectVolatile(a, i))) != null &&

In JDK 11 ForkJoinPool uses getAcquire to get a task from the array
t = (ForkJoinTask<?>)QA.getAcquire(a, k = (cap - 1) & b);


On 16 July 2020 20:15 Chris Povirk via Concurrency-interest <[hidden email]> wrote:


Additional info:

We have  worked around the immediate TSAN failure.

Unfortunately, I don't have a test case packed up for you to run. What I have is entangled with Google infrastructure (which is different from some JDK infrastructure Martin discussed last year).

But here is the code I've been running and a TSAN failure report. (TSAN gives an error maybe about half the time I run the test.) The report comes from a run against the "refresh" ForkJoinPool.

Again, the failure is a TSAN failure only: I haven't been able to see a stale value in practice, so my assertFalse check is always succeeding.

package com.google.common;

import static java.util.concurrent.TimeUnit.SECONDS;
import static org.junit.Assert.assertFalse;

import java.util.concurrent.ForkJoinPool;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;

@RunWith(JUnit4.class)
public class ForkJoinPoolTsanTest {
  @Test
  public void testHandoff() throws Exception {
    ForkJoinPool service = new ForkJoinPool();
    boolean[] failed = new boolean[1];
    for (int j = 0; j < 5; j++) {
      service.execute(new Job(failed));
    }
    service.shutdown();
    service.awaitTermination(10, SECONDS);
    assertFalse(failed[0]);
  }

  private static final class Job implements Runnable {
    final boolean[] failed;
    boolean b = true;

    Job(boolean[] failed) {
      this.failed = failed;
    }

    @Override
    public void run() {
      if (!b) {
        failed[0] = true;
      }
    }
  }
}

Read of size 1 at 0x0000cf6f8d54 by thread T18:
  #0 com.google.common.ForkJoinPoolTsanTest$Job.run()V ForkJoinPoolTsanTest.java:35
  #1 java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec()Z ForkJoinTask.java:1341
  #2 java.util.concurrent.ForkJoinTask.doExec()I ForkJoinTask.java:435
  #3 java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Ljava/util/concurrent/ForkJoinTask;Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V ForkJoinPool.java:1128
  #4 java.util.concurrent.ForkJoinPool.scan(Ljava/util/concurrent/ForkJoinPool$WorkQueue;II)I ForkJoinPool.java:1598
  #5 java.util.concurrent.ForkJoinPool.runWorker(Ljava/util/concurrent/ForkJoinPool$WorkQueue;)V ForkJoinPool.java:1565
  #6 java.util.concurrent.ForkJoinWorkerThread.run()V ForkJoinWorkerThread.java:138

Previous write of size 1 at 0x0000cf6f8d54 by thread T5:
  #0 com.google.common.ForkJoinPoolTsanTest$Job.<init>([Z)V ForkJoinPoolTsanTest.java:27
  #1 com.google.common.ForkJoinPoolTsanTest.testHandoff()V ForkJoinPoolTsanTest.java:18

Thread T18 (tid=661, running) created by thread T5 at:
  #0 pthread_create third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:965:3 (4766092b7397d478e6abfb6f139c4adc84ae7e74c64c75064c31eca5ef3d4346_02000308b290+0x36b82a)
  #1 os::create_thread(Thread*, os::ThreadType, unsigned long) <null> (libjvm.so+0x1148e95)
  #2 java.lang.Thread.start()V Thread.java:803
  #3 java.util.concurrent.ForkJoinPool.createWorker()Z ForkJoinPool.java:1408
  #4 java.util.concurrent.ForkJoinPool.signalWork()V ForkJoinPool.java:1531
  #5 java.util.concurrent.ForkJoinPool.externalPush(Ljava/util/concurrent/ForkJoinTask;)V ForkJoinPool.java:2128
  #6 java.util.concurrent.ForkJoinPool.externalSubmit(Ljava/util/concurrent/ForkJoinTask;)Ljava/util/concurrent/ForkJoinTask; ForkJoinPool.java:2143
  #7 java.util.concurrent.ForkJoinPool.execute(Ljava/lang/Runnable;)V ForkJoinPool.java:2587
  #8 com.google.common.ForkJoinPoolTsanTest.testHandoff()V ForkJoinPoolTsanTest.java:18
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: (hypothetical? bogus?) TSAN failure reported in ForkJoinPool (JDK11 and "refresh" versions)

JSR166 Concurrency mailing list
Thanks. My impression is that even a <setRelease,getAcquire> pair is enough to establish the happens-before relationship. (I say that mainly from skimming http://gee.cs.oswego.edu/dl/html/j9mm.html, not from any deep understanding. I see that that doc even recommends against the term "happens-before" :)) And for what it's worth, that's enough to satisfy our TSAN setup, too.

The most interesting change relative to JDK8 (which is what we were using before) might be on the write side: Where JDK8 always(?) used putOrderedObject (apparently similar to volatile) or similar, JDK11 sometimes uses only(?) that plain write I linked above.

It occurs to me only now that I'm writing this that, thanks to the power of --patch-module, I could probably tweak the ForkJoinPool source to use a "stronger" write and see if the TSAN failure goes away. Maybe I'll try that if I get a chance....

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: (hypothetical? bogus?) TSAN failure reported in ForkJoinPool (JDK11 and "refresh" versions)

JSR166 Concurrency mailing list
I built jsr166 from CVS head. I can reproduce the TSAN failure with that.

If I edit lockedPush to use setRelease instead of a plain write, the failure goes away.

(Oddly, the failure also goes away if I merely switch to setOpaque, which I would not have thought would be enough. This may well be more of an artifact of what our TSAN can detect. (And again, this whole failure may well be an artifact of TSAN.))

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: (hypothetical? bogus?) TSAN failure reported in ForkJoinPool (JDK11 and "refresh" versions)

JSR166 Concurrency mailing list


On 7/17/20 10:55 AM, Chris Povirk via Concurrency-interest wrote:
I built jsr166 from CVS head. I can reproduce the TSAN failure with that.

If I edit lockedPush to use setRelease instead of a plain write, the failure goes away.

(Oddly, the failure also goes away if I merely switch to setOpaque, which I would not have thought would be enough. This may well be more of an artifact of what our TSAN can detect. (And again, this whole failure may well be an artifact of TSAN.))

Yes, this does seem to be a TSAN limitation that would be hard to address.  We could change the write to Opaque mode to be clearer about it eventually occurring, but subsequent release or volatile accesses that are guaranteed to follow eventually ensure it anyway (while also allowing a tiny bit more parallelism here).

-Doug




_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: (hypothetical? bogus?) TSAN failure reported in ForkJoinPool (JDK11 and "refresh" versions)

JSR166 Concurrency mailing list
Thanks.

The puzzling thing to me has been the "guaranteed to follow eventually" part: As best I can tell, a worker thread can potentially see the submitting thread's plain write without any subsequent action by the submitting thread. (Some evidence: I edited ForkJoinPool so that it will randomly decide to loop infinitely immediately after the write sometimes. With that change, I sometimes still see a worker thread pick up the submitted task during the submitter's infinite loop. Or at least I think that's what I'm seeing.) Thus, we can't be relying on subsequent work by the submitting thread.

So, to provide the edge we need, I gathered that the worker thread has to do something. And I see now that I was misunderstanding its weakCompareAndSet: I had read the docs on (JDK8) AtomicReference.weakCompareAndSet, which provides no ordering guarantees. But I see that that name has been deprecated as misleading, and the VarHandle weakCompareAndSet (the one that's used in ForkJoinPoo.scandoes provide ordering guarantees. Then, while I don't know the details of mixing a plain write with an acquire read + volatile CAS, I can imagine that any writes prior to the plain write in the submitting thread are guaranteed visible to the worker thread after its successful CAS.

However, this still feels a little iffy: Since the submitter is performing only plain writes, it's not obvious to me what would prevent it from writing the task to the queue (and then being delayed arbitrarily long) before writing the boolean flag. Perhaps the worker's weakCompareAndSet somehow fails "spuriously" in that case?

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: (hypothetical? bogus?) TSAN failure reported in ForkJoinPool (JDK11 and "refresh" versions)

JSR166 Concurrency mailing list

This now looks like a simpler issue than I was thinking...

On 7/17/20 2:19 PM, Chris Povirk wrote:

So, to provide the edge we need, I gathered that the worker thread has to do something. And I see now that I was misunderstanding its weakCompareAndSet: I had read the docs on (JDK8) AtomicReference.weakCompareAndSet, which provides no ordering guarantees. But I see that that name has been deprecated as misleading, and the VarHandle weakCompareAndSet (the one that's used in ForkJoinPoo.scandoes provide ordering guarantees.

Yes, prior to JDK9, the only supported weakCompareAndSet method was weak in both the sense of spurious failures and  ordering. But there are now multiple versions, and the one used here has ordering equivalent to compareAndSet on success. (The package-level docs were changed accordingly, to reference VarHandle docs.) So externalPush has memory semantics of lock-unlock.

-Doug



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: (hypothetical? bogus?) TSAN failure reported in ForkJoinPool (JDK11 and "refresh" versions)

JSR166 Concurrency mailing list
Thanks. At this point, I am definitely in over my head :) If anything comes of internal discussions about TSAN, I will report back.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest