strange performance issue with ArrayBlockingQueue on Linux

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

strange performance issue with ArrayBlockingQueue on Linux

Matthews, Graham

​[apologies in advance if this is the wrong place to discuss this issue. also apologies for the long setup -- I promise this is about ArrayBlockingQueue, Java, and a performance issue with them]


My student and I have been benchmarking various ways to run cross data source joins. The basic setup is as follows:


           J2

           /  \

          /    \

        R3  J1

              /  \

             /    \

           R2  R1

R1, R2 and R3 are called receive operators as they receive data from the databases. J1 and J2 are inner join operators joining values from R2 with values from R1 and then the result with values from R3.

One way to run these queries is the classic producers and consumers approach -- so each operator is a Java thread, and there are bounded queues (ArrayBlockingQueues) between the operators. So the above example has 5 threads and 4 ArrayBlockingQueues. So here we interleave I/O with computation.

Another way to run this is to first load R3 into J2 in parallel with loading R2 into J1 (the loads build hash maps in J1 and J2), and then have a thread pool with N workers (where N is the number of cores), where each worker takes a value from R1 and joins it "up the spine" of the tree -- so matches against the hash map in J1 and then J2. This approach (the phased approach) therefore has an I/O phase, followed by a computation phase.


On OS X El Capitan (10.11.6), the queueing approach and the phased approach perform roughly equally (more variation in the run times for the queueing approach, but performance within 10% of each other). On Linux (Ubuntu 16.04 LTS) the queueing approach is almost three times slower than the phased approach!

On top of that the queueing approach runs over twice as fast on OS X as it is does on Linux, despite the fact that I am running OS X on old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and Linux on relatively new hardware (6th Gen Intel Core i7-5500U).

We have boiled the problem down to code that doesn't involve databases -- just threads and ArrayBlockingQueues. Attached are 5 Java files which "approximate" the queueing approach by having each receive thread simply output numbers (rather than get values from a database) to an ArrayBlockingQueue, and each Join thread simply read its two input queues, adds the values together, and then output the sum to the output queue.

On the same hardware above this code runs in 1 minute 30 secs on OS X, and 2 minutes 13 secs on Linux!

The code is very simple -- just a bunch of threads and a bunch of ArrayBlockingQueues. So I am at a loss to explain the size-able performance differences noted above.

Again apologies if this is not the right forum to discuss this issue.

graham


------ 
Graham MATTHEWS, Ph.D.
Assistant Professor
Department of Computer Science |  California Lutheran University
60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
[hidden email]  | Office location: D19

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

Buffer.java (268 bytes) Download Attachment
Join.java (1K) Download Attachment
Receive.java (970 bytes) Download Attachment
Stream.java (912 bytes) Download Attachment
Test.java (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: strange performance issue with ArrayBlockingQueue on Linux

Martin Buchholz-3
One obvious variable is the version of ABQ you are using.  It was modified in jdk9 and you should be doing all performance work with latest jdk9 binaries e.g. from http://jdk.java.net/9/

On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham <[hidden email]> wrote:

​[apologies in advance if this is the wrong place to discuss this issue. also apologies for the long setup -- I promise this is about ArrayBlockingQueue, Java, and a performance issue with them]


My student and I have been benchmarking various ways to run cross data source joins. The basic setup is as follows:


           J2

           /  \

          /    \

        R3  J1

              /  \

             /    \

           R2  R1

R1, R2 and R3 are called receive operators as they receive data from the databases. J1 and J2 are inner join operators joining values from R2 with values from R1 and then the result with values from R3.

One way to run these queries is the classic producers and consumers approach -- so each operator is a Java thread, and there are bounded queues (ArrayBlockingQueues) between the operators. So the above example has 5 threads and 4 ArrayBlockingQueues. So here we interleave I/O with computation.

Another way to run this is to first load R3 into J2 in parallel with loading R2 into J1 (the loads build hash maps in J1 and J2), and then have a thread pool with N workers (where N is the number of cores), where each worker takes a value from R1 and joins it "up the spine" of the tree -- so matches against the hash map in J1 and then J2. This approach (the phased approach) therefore has an I/O phase, followed by a computation phase.


On OS X El Capitan (10.11.6), the queueing approach and the phased approach perform roughly equally (more variation in the run times for the queueing approach, but performance within 10% of each other). On Linux (Ubuntu 16.04 LTS) the queueing approach is almost three times slower than the phased approach!

On top of that the queueing approach runs over twice as fast on OS X as it is does on Linux, despite the fact that I am running OS X on old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and Linux on relatively new hardware (6th Gen Intel Core i7-5500U).

We have boiled the problem down to code that doesn't involve databases -- just threads and ArrayBlockingQueues. Attached are 5 Java files which "approximate" the queueing approach by having each receive thread simply output numbers (rather than get values from a database) to an ArrayBlockingQueue, and each Join thread simply read its two input queues, adds the values together, and then output the sum to the output queue.

On the same hardware above this code runs in 1 minute 30 secs on OS X, and 2 minutes 13 secs on Linux!

The code is very simple -- just a bunch of threads and a bunch of ArrayBlockingQueues. So I am at a loss to explain the size-able performance differences noted above.

Again apologies if this is not the right forum to discuss this issue.

graham


------ 
Graham MATTHEWS, Ph.D.
Assistant Professor
Department of Computer Science |  California Lutheran University
60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
[hidden email]  | Office location: D19

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: strange performance issue with ArrayBlockingQueue on Linux

Matthews, Graham

​Ah yes, sorry I should have said we are working in Java 8.

graham


------ 
Graham MATTHEWS, Ph.D.
Assistant Professor
Department of Computer Science |  California Lutheran University
60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
[hidden email]  | Office location: D19

From: Martin Buchholz <[hidden email]>
Sent: Wednesday, July 19, 2017 8:57 PM
To: Matthews, Graham
Cc: [hidden email]
Subject: Re: [concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux
 
One obvious variable is the version of ABQ you are using.  It was modified in jdk9 and you should be doing all performance work with latest jdk9 binaries e.g. from http://jdk.java.net/9/

On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham <[hidden email]> wrote:

​[apologies in advance if this is the wrong place to discuss this issue. also apologies for the long setup -- I promise this is about ArrayBlockingQueue, Java, and a performance issue with them]


My student and I have been benchmarking various ways to run cross data source joins. The basic setup is as follows:


           J2

           /  \

          /    \

        R3  J1

              /  \

             /    \

           R2  R1

R1, R2 and R3 are called receive operators as they receive data from the databases. J1 and J2 are inner join operators joining values from R2 with values from R1 and then the result with values from R3.

One way to run these queries is the classic producers and consumers approach -- so each operator is a Java thread, and there are bounded queues (ArrayBlockingQueues) between the operators. So the above example has 5 threads and 4 ArrayBlockingQueues. So here we interleave I/O with computation.

Another way to run this is to first load R3 into J2 in parallel with loading R2 into J1 (the loads build hash maps in J1 and J2), and then have a thread pool with N workers (where N is the number of cores), where each worker takes a value from R1 and joins it "up the spine" of the tree -- so matches against the hash map in J1 and then J2. This approach (the phased approach) therefore has an I/O phase, followed by a computation phase.


On OS X El Capitan (10.11.6), the queueing approach and the phased approach perform roughly equally (more variation in the run times for the queueing approach, but performance within 10% of each other). On Linux (Ubuntu 16.04 LTS) the queueing approach is almost three times slower than the phased approach!

On top of that the queueing approach runs over twice as fast on OS X as it is does on Linux, despite the fact that I am running OS X on old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and Linux on relatively new hardware (6th Gen Intel Core i7-5500U).

We have boiled the problem down to code that doesn't involve databases -- just threads and ArrayBlockingQueues. Attached are 5 Java files which "approximate" the queueing approach by having each receive thread simply output numbers (rather than get values from a database) to an ArrayBlockingQueue, and each Join thread simply read its two input queues, adds the values together, and then output the sum to the output queue.

On the same hardware above this code runs in 1 minute 30 secs on OS X, and 2 minutes 13 secs on Linux!

The code is very simple -- just a bunch of threads and a bunch of ArrayBlockingQueues. So I am at a loss to explain the size-able performance differences noted above.

Again apologies if this is not the right forum to discuss this issue.

graham


------ 
Graham MATTHEWS, Ph.D.
Assistant Professor
Department of Computer Science |  California Lutheran University
60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
[hidden email]  | Office location: D19

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: strange performance issue with ArrayBlockingQueue on Linux

Brian S O'Neill-3
You also didn't mention CPU core count. I ran the test on two machines,
with Java 8 and with Java 9. With Java 9 I explicitly selected the
parallel GC, because the G1 collector in Java 9 is a CPU pig that slows
everything down.

Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores

Machine 1, Java 8: 9.2 seconds
Machine 1, Java 9: 9.1 seconds
Machine 2, Java 8: 16.9 seconds
Machine 2, Java 9: 12.1 seconds

At first blush it looks like the problem is Linux, but I don't have a
proper apples-to-apples comparison. I noticed that with Java 8, the
number of context switches per second was about 2.5 million, and with
Java 9 it was about 2.3 million. Either way, this is really high. On
Windows, the context switch rate was about 1.9 million per second.

I also see a lot of variation between test runs, and when I run the code
in a loop, the performance (on Windows) varies from 9 to 14 seconds per
run. On Linux, I see 11 to 21 seconds when running in a loop. I also
added an explicit System.gc between each test run.

When I run with -Xprof, I see the threads spending most of their time in
Unsafe.unpark and Unsafe.park. I've never understood why unpark always
comes up as a performance bottleneck. What does it do that's so expensive?


On 2017-07-19 09:11 PM, Matthews, Graham wrote:

> ​Ah yes, sorry I should have said we are working in Java 8.
>
> graham
>
>
> ------
> Graham MATTHEWS, Ph.D.
> Assistant Professor
> Department of Computer Science |  California Lutheran University
> 60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
> [hidden email]  | Office location: D19
> ------------------------------------------------------------------------
> *From:* Martin Buchholz <[hidden email]>
> *Sent:* Wednesday, July 19, 2017 8:57 PM
> *To:* Matthews, Graham
> *Cc:* [hidden email]
> *Subject:* Re: [concurrency-interest] strange performance issue with
> ArrayBlockingQueue on Linux
> One obvious variable is the version of ABQ you are using.  It was
> modified in jdk9 and you should be doing all performance work with
> latest jdk9 binaries e.g. from http://jdk.java.net/9/
>
> On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     ​[apologies in advance if this is the wrong place to discuss this
>     issue. also apologies for the long setup -- I promise this is about
>     ArrayBlockingQueue, Java, and a performance issue with them]
>
>
>     My student and I have been benchmarking various ways to run cross
>     data source joins. The basic setup is as follows:
>
>
>                 J2
>
>                 /  \
>
>                /    \
>
>              R3  J1
>
>                    /  \
>
>                   /    \
>
>                 R2  R1
>
>     R1, R2 and R3 are called receive operators as they receive data from
>     the databases. J1 and J2 are inner join operators joining values
>     from R2 with values from R1 and then the result with values from R3.
>
>     One way to run these queries is the classic producers and consumers
>     approach -- so each operator is a Java thread, and there are
>     bounded queues (ArrayBlockingQueues) between the operators. So the
>     above example has 5 threads and 4 ArrayBlockingQueues. So here we
>     interleave I/O with computation.
>
>     Another way to run this is to first load R3 into J2 in parallel with
>     loading R2 into J1 (the loads build hash maps in J1 and J2), and
>     then have a thread pool with N workers (where N is the number of
>     cores), where each worker takes a value from R1 and joins it "up the
>     spine" of the tree -- so matches against the hash map in J1 and then
>     J2. This approach (the phased approach) therefore has an I/O phase,
>     followed by a computation phase.
>
>
>     On OS X El Capitan (10.11.6), the queueing approach and the phased
>     approach perform roughly equally (more variation in the run times
>     for the queueing approach, but performance within 10% of each
>     other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
>     three times slower than the phased approach!
>
>     On top of that the queueing approach runs over twice as fast on OS X
>     as it is does on Linux, despite the fact that I am running OS X on
>     old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
>     Linux on relatively new hardware (6th Gen Intel Core i7-5500U).
>
>     We have boiled the problem down to code that doesn't involve
>     databases -- just threads and ArrayBlockingQueues. Attached are
>     5 Java files which "approximate" the queueing approach by having
>     each receive thread simply output numbers (rather than get values
>     from a database) to an ArrayBlockingQueue, and
>     each Join thread simply read its two input queues, adds the values
>     together, and then output the sum to the output queue.
>
>     Onthe same hardware above this code runs in 1 minute 30 secs on OS
>     X, and 2 minutes 13 secs on Linux!
>
>     The code is very simple -- just a bunch of threads and a bunch of
>     ArrayBlockingQueues. So I am at a loss to explain the size-able
>     performance differences noted above.
>
>     Again apologies if this is not the right forum to discuss this issue.
>
>     graham
>
>     ​
>     ------
>     Graham MATTHEWS, Ph.D.
>     Assistant Professor
>     Department of Computer Science |  California Lutheran University
>     60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>     [hidden email] <mailto:[hidden email]>  | Office
>     location: D19
>
>     _______________________________________________
>     Concurrency-interest mailing list
>     [hidden email]
>     <mailto:[hidden email]>
>     http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>     <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>
>
>
>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: strange performance issue with ArrayBlockingQueue on Linux

Martin Buchholz-3
I also play the benchmark game and like to produce many variations, in the style of


One obvious thing to try is different BlockingQueue implementations.
(but it sounds like the key to the performance here is frequency and overhead of context switch.)


On Wed, Jul 19, 2017 at 9:50 PM, Brian S O'Neill <[hidden email]> wrote:
You also didn't mention CPU core count. I ran the test on two machines, with Java 8 and with Java 9. With Java 9 I explicitly selected the parallel GC, because the G1 collector in Java 9 is a CPU pig that slows everything down.

Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores

Machine 1, Java 8: 9.2 seconds
Machine 1, Java 9: 9.1 seconds
Machine 2, Java 8: 16.9 seconds
Machine 2, Java 9: 12.1 seconds

At first blush it looks like the problem is Linux, but I don't have a proper apples-to-apples comparison. I noticed that with Java 8, the number of context switches per second was about 2.5 million, and with Java 9 it was about 2.3 million. Either way, this is really high. On Windows, the context switch rate was about 1.9 million per second.

I also see a lot of variation between test runs, and when I run the code in a loop, the performance (on Windows) varies from 9 to 14 seconds per run. On Linux, I see 11 to 21 seconds when running in a loop. I also added an explicit System.gc between each test run.

When I run with -Xprof, I see the threads spending most of their time in Unsafe.unpark and Unsafe.park. I've never understood why unpark always comes up as a performance bottleneck. What does it do that's so expensive?


On 2017-07-19 09:11 PM, Matthews, Graham wrote:
​Ah yes, sorry I should have said we are working in Java 8.

graham


------
Graham MATTHEWS, Ph.D.
Assistant Professor
Department of Computer Science |  California Lutheran University
60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
[hidden email]  | Office location: D19
------------------------------------------------------------------------
*From:* Martin Buchholz <[hidden email]>
*Sent:* Wednesday, July 19, 2017 8:57 PM
*To:* Matthews, Graham
*Cc:* [hidden email]
*Subject:* Re: [concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux
One obvious variable is the version of ABQ you are using.  It was modified in jdk9 and you should be doing all performance work with latest jdk9 binaries e.g. from http://jdk.java.net/9/

On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham <[hidden email] <mailto:[hidden email]>> wrote:

    ​[apologies in advance if this is the wrong place to discuss this
    issue. also apologies for the long setup -- I promise this is about
    ArrayBlockingQueue, Java, and a performance issue with them]


    My student and I have been benchmarking various ways to run cross
    data source joins. The basic setup is as follows:


                J2

                /  \

               /    \

             R3  J1

                   /  \

                  /    \

                R2  R1

    R1, R2 and R3 are called receive operators as they receive data from
    the databases. J1 and J2 are inner join operators joining values
    from R2 with values from R1 and then the result with values from R3.

    One way to run these queries is the classic producers and consumers
    approach -- so each operator is a Java thread, and there are
    bounded queues (ArrayBlockingQueues) between the operators. So the
    above example has 5 threads and 4 ArrayBlockingQueues. So here we
    interleave I/O with computation.

    Another way to run this is to first load R3 into J2 in parallel with
    loading R2 into J1 (the loads build hash maps in J1 and J2), and
    then have a thread pool with N workers (where N is the number of
    cores), where each worker takes a value from R1 and joins it "up the
    spine" of the tree -- so matches against the hash map in J1 and then
    J2. This approach (the phased approach) therefore has an I/O phase,
    followed by a computation phase.


    On OS X El Capitan (10.11.6), the queueing approach and the phased
    approach perform roughly equally (more variation in the run times
    for the queueing approach, but performance within 10% of each
    other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
    three times slower than the phased approach!

    On top of that the queueing approach runs over twice as fast on OS X
    as it is does on Linux, despite the fact that I am running OS X on
    old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
    Linux on relatively new hardware (6th Gen Intel Core i7-5500U).

    We have boiled the problem down to code that doesn't involve
    databases -- just threads and ArrayBlockingQueues. Attached are
    5 Java files which "approximate" the queueing approach by having
    each receive thread simply output numbers (rather than get values
    from a database) to an ArrayBlockingQueue, and
    each Join thread simply read its two input queues, adds the values
    together, and then output the sum to the output queue.

    Onthe same hardware above this code runs in 1 minute 30 secs on OS
    X, and 2 minutes 13 secs on Linux!

    The code is very simple -- just a bunch of threads and a bunch of
    ArrayBlockingQueues. So I am at a loss to explain the size-able
    performance differences noted above.

    Again apologies if this is not the right forum to discuss this issue.

    graham

    ​
    ------
    Graham MATTHEWS, Ph.D.
    Assistant Professor
    Department of Computer Science |  California Lutheran University
    60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
    [hidden email] <mailto:[hidden email]>  | Office
    location: D19

    _______________________________________________
    Concurrency-interest mailing list
    [hidden email]
    <mailto:[hidden email]>
    http://cs.oswego.edu/mailman/listinfo/concurrency-interest
    <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>




_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: strange performance issue with ArrayBlockingQueue on Linux

Nathan and Ila Reynolds
In reply to this post by Brian S O'Neill-3
 > I've never understood why unpark always comes up as a performance
bottleneck. What does it do that's so expensive?

A while ago, I talked with the engineer who has done a lot of
optimization on LockSupport.unpark().  I can't remember if that implies
Unsafe.unpark().

On Windows, LockSupport.unpark() goes into the kernel 1 time to set a
Windows Event object.  On Linux, LockSupport.unpark() acquires a lock,
sets a condition wait and releases a lock.  This *might* require 3 trips
into the kernel.

Furthermore, the scheduler might let the unparked thread start running
and unparking thread yield (if no logical cores are available).  Hence,
unpark() will take even more time.

-Nathan

On 7/19/2017 10:50 PM, Brian S O'Neill wrote:

> You also didn't mention CPU core count. I ran the test on two
> machines, with Java 8 and with Java 9. With Java 9 I explicitly
> selected the parallel GC, because the G1 collector in Java 9 is a CPU
> pig that slows everything down.
>
> Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
> Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores
>
> Machine 1, Java 8: 9.2 seconds
> Machine 1, Java 9: 9.1 seconds
> Machine 2, Java 8: 16.9 seconds
> Machine 2, Java 9: 12.1 seconds
>
> At first blush it looks like the problem is Linux, but I don't have a
> proper apples-to-apples comparison. I noticed that with Java 8, the
> number of context switches per second was about 2.5 million, and with
> Java 9 it was about 2.3 million. Either way, this is really high. On
> Windows, the context switch rate was about 1.9 million per second.
>
> I also see a lot of variation between test runs, and when I run the
> code in a loop, the performance (on Windows) varies from 9 to 14
> seconds per run. On Linux, I see 11 to 21 seconds when running in a
> loop. I also added an explicit System.gc between each test run.
>
> When I run with -Xprof, I see the threads spending most of their time
> in Unsafe.unpark and Unsafe.park. I've never understood why unpark
> always comes up as a performance bottleneck. What does it do that's so
> expensive?
>
>
> On 2017-07-19 09:11 PM, Matthews, Graham wrote:
>> ​Ah yes, sorry I should have said we are working in Java 8.
>>
>> graham
>>
>>
>> ------
>> Graham MATTHEWS, Ph.D.
>> Assistant Professor
>> Department of Computer Science |  California Lutheran University
>> 60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>> [hidden email]  | Office location: D19
>> ------------------------------------------------------------------------
>> *From:* Martin Buchholz <[hidden email]>
>> *Sent:* Wednesday, July 19, 2017 8:57 PM
>> *To:* Matthews, Graham
>> *Cc:* [hidden email]
>> *Subject:* Re: [concurrency-interest] strange performance issue with
>> ArrayBlockingQueue on Linux
>> One obvious variable is the version of ABQ you are using.  It was
>> modified in jdk9 and you should be doing all performance work with
>> latest jdk9 binaries e.g. from http://jdk.java.net/9/
>>
>> On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham
>> <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>     ​[apologies in advance if this is the wrong place to discuss this
>>     issue. also apologies for the long setup -- I promise this is about
>>     ArrayBlockingQueue, Java, and a performance issue with them]
>>
>>
>>     My student and I have been benchmarking various ways to run cross
>>     data source joins. The basic setup is as follows:
>>
>>
>>                 J2
>>
>>                 /  \
>>
>>                /    \
>>
>>              R3  J1
>>
>>                    /  \
>>
>>                   /    \
>>
>>                 R2  R1
>>
>>     R1, R2 and R3 are called receive operators as they receive data from
>>     the databases. J1 and J2 are inner join operators joining values
>>     from R2 with values from R1 and then the result with values from R3.
>>
>>     One way to run these queries is the classic producers and consumers
>>     approach -- so each operator is a Java thread, and there are
>>     bounded queues (ArrayBlockingQueues) between the operators. So the
>>     above example has 5 threads and 4 ArrayBlockingQueues. So here we
>>     interleave I/O with computation.
>>
>>     Another way to run this is to first load R3 into J2 in parallel with
>>     loading R2 into J1 (the loads build hash maps in J1 and J2), and
>>     then have a thread pool with N workers (where N is the number of
>>     cores), where each worker takes a value from R1 and joins it "up the
>>     spine" of the tree -- so matches against the hash map in J1 and then
>>     J2. This approach (the phased approach) therefore has an I/O phase,
>>     followed by a computation phase.
>>
>>
>>     On OS X El Capitan (10.11.6), the queueing approach and the phased
>>     approach perform roughly equally (more variation in the run times
>>     for the queueing approach, but performance within 10% of each
>>     other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
>>     three times slower than the phased approach!
>>
>>     On top of that the queueing approach runs over twice as fast on OS X
>>     as it is does on Linux, despite the fact that I am running OS X on
>>     old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
>>     Linux on relatively new hardware (6th Gen Intel Core i7-5500U).
>>
>>     We have boiled the problem down to code that doesn't involve
>>     databases -- just threads and ArrayBlockingQueues. Attached are
>>     5 Java files which "approximate" the queueing approach by having
>>     each receive thread simply output numbers (rather than get values
>>     from a database) to an ArrayBlockingQueue, and
>>     each Join thread simply read its two input queues, adds the values
>>     together, and then output the sum to the output queue.
>>
>>     Onthe same hardware above this code runs in 1 minute 30 secs on OS
>>     X, and 2 minutes 13 secs on Linux!
>>
>>     The code is very simple -- just a bunch of threads and a bunch of
>>     ArrayBlockingQueues. So I am at a loss to explain the size-able
>>     performance differences noted above.
>>
>>     Again apologies if this is not the right forum to discuss this
>> issue.
>>
>>     graham
>>
>>     ​
>>     ------
>>     Graham MATTHEWS, Ph.D.
>>     Assistant Professor
>>     Department of Computer Science |  California Lutheran University
>>     60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>>     [hidden email] <mailto:[hidden email]>  | Office
>>     location: D19
>>
>>     _______________________________________________
>>     Concurrency-interest mailing list
>>     [hidden email]
>>     <mailto:[hidden email]>
>>     http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>> <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>>
>>
>>
>>
>> _______________________________________________
>> Concurrency-interest mailing list
>> [hidden email]
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

--
-Nathan

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: strange performance issue with ArrayBlockingQueue on Linux

Brian S O'Neill-3
In this particular test, the number of active threads was less than or
equal to the number of logical cores. So any cost due to unpark possibly
triggering a yield shouldn't apply. My understanding is that park/unpark
on Linux relies on a futex, and so perhaps this is just one kernel call.

On 2017-07-20 07:45 AM, Nathan and Ila Reynolds wrote:

>  > I've never understood why unpark always comes up as a performance
> bottleneck. What does it do that's so expensive?
>
> A while ago, I talked with the engineer who has done a lot of
> optimization on LockSupport.unpark().  I can't remember if that implies
> Unsafe.unpark().
>
> On Windows, LockSupport.unpark() goes into the kernel 1 time to set a
> Windows Event object.  On Linux, LockSupport.unpark() acquires a lock,
> sets a condition wait and releases a lock.  This *might* require 3 trips
> into the kernel.
>
> Furthermore, the scheduler might let the unparked thread start running
> and unparking thread yield (if no logical cores are available).  Hence,
> unpark() will take even more time.
>
> -Nathan
>

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: strange performance issue with ArrayBlockingQueue on Linux

Alex Otenko
In reply to this post by Nathan and Ila Reynolds
This particular case is not about the cost of a context switch, and not about the number of context switches per second. This is about the amount of work done per switch.


Consider J2. J2 moving data unblocks J1 and J3. If J1 gets on the CPU, the context switch allows J1 to move just one data item, then it gets blocked again. If J3 gets on the CPU, J2 moves as much data as possible - basically, emptying J1’s buffer, so when J2 is blocked, it created a lot of work for both J3 (kept feeding its input queue) and J1 (emptied its output queue), so when J1 does get on the core again, it can move a lot of items upstream to J2 (which in turn is better if J2 waits till there’s a lot of work available, instead of getting agitated the moment a single item is in its queue).

Example change in Join.java (synthetically biasing who gets blocked when) dramatically changed times for my Ubuntu (on Virtual Box):

    @Override
    public void run() {
        long [] arr = new long[1000];
        int pos = 0;
        try {
            while (true) {
                long lhsValue = this.lhsInput.take();
                long rhsValue = this.rhsInput.take();

                /* Check for end-of-stream markers. */
                if (lhsValue == Stream.EOS) {
                    this.eosCount++;
                }
                if (rhsValue == Stream.EOS) {
                    this.eosCount++;
                }

                /* Join values or send end-of-stream marker.*/
                if (this.eosCount == 2) {
                    for(int i = 0; i < pos; i++) {
                      this.put(arr[i]);
                    }
                    break;
                }
                else {
                    arr[pos++] = lhsValue + rhsValue;
                    if (pos == arr.length) {
                      for(int i = 0; i < pos; i++) {
                        this.put(arr[i]);
                      }
                      pos = 0;
                    }
                }
            }
            this.put(Stream.EOS);
            System.out.println(this + " : END");
        }
        catch (InterruptedException ex) {
            ex.printStackTrace(System.err);
        }
    }


Alex


On 20 Jul 2017, at 15:45, Nathan and Ila Reynolds <[hidden email]> wrote:

> I've never understood why unpark always comes up as a performance bottleneck. What does it do that's so expensive?

A while ago, I talked with the engineer who has done a lot of optimization on LockSupport.unpark().  I can't remember if that implies Unsafe.unpark().

On Windows, LockSupport.unpark() goes into the kernel 1 time to set a Windows Event object.  On Linux, LockSupport.unpark() acquires a lock, sets a condition wait and releases a lock.  This *might* require 3 trips into the kernel.

Furthermore, the scheduler might let the unparked thread start running and unparking thread yield (if no logical cores are available).  Hence, unpark() will take even more time.

-Nathan

On 7/19/2017 10:50 PM, Brian S O'Neill wrote:
You also didn't mention CPU core count. I ran the test on two machines, with Java 8 and with Java 9. With Java 9 I explicitly selected the parallel GC, because the G1 collector in Java 9 is a CPU pig that slows everything down.

Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores

Machine 1, Java 8: 9.2 seconds
Machine 1, Java 9: 9.1 seconds
Machine 2, Java 8: 16.9 seconds
Machine 2, Java 9: 12.1 seconds

At first blush it looks like the problem is Linux, but I don't have a proper apples-to-apples comparison. I noticed that with Java 8, the number of context switches per second was about 2.5 million, and with Java 9 it was about 2.3 million. Either way, this is really high. On Windows, the context switch rate was about 1.9 million per second.

I also see a lot of variation between test runs, and when I run the code in a loop, the performance (on Windows) varies from 9 to 14 seconds per run. On Linux, I see 11 to 21 seconds when running in a loop. I also added an explicit System.gc between each test run.

When I run with -Xprof, I see the threads spending most of their time in Unsafe.unpark and Unsafe.park. I've never understood why unpark always comes up as a performance bottleneck. What does it do that's so expensive?


On 2017-07-19 09:11 PM, Matthews, Graham wrote:
​Ah yes, sorry I should have said we are working in Java 8.

graham


------
Graham MATTHEWS, Ph.D.
Assistant Professor
Department of Computer Science |  California Lutheran University
60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
[hidden email]  | Office location: D19
------------------------------------------------------------------------
*From:* Martin Buchholz <[hidden email]>
*Sent:* Wednesday, July 19, 2017 8:57 PM
*To:* Matthews, Graham
*Cc:* [hidden email]
*Subject:* Re: [concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux
One obvious variable is the version of ABQ you are using.  It was modified in jdk9 and you should be doing all performance work with latest jdk9 binaries e.g. from http://jdk.java.net/9/

On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham <[hidden email] <[hidden email]>> wrote:

   ​[apologies in advance if this is the wrong place to discuss this
   issue. also apologies for the long setup -- I promise this is about
   ArrayBlockingQueue, Java, and a performance issue with them]


   My student and I have been benchmarking various ways to run cross
   data source joins. The basic setup is as follows:


               J2

               /  \

              /    \

            R3  J1

                  /  \

                 /    \

               R2  R1

   R1, R2 and R3 are called receive operators as they receive data from
   the databases. J1 and J2 are inner join operators joining values
   from R2 with values from R1 and then the result with values from R3.

   One way to run these queries is the classic producers and consumers
   approach -- so each operator is a Java thread, and there are
   bounded queues (ArrayBlockingQueues) between the operators. So the
   above example has 5 threads and 4 ArrayBlockingQueues. So here we
   interleave I/O with computation.

   Another way to run this is to first load R3 into J2 in parallel with
   loading R2 into J1 (the loads build hash maps in J1 and J2), and
   then have a thread pool with N workers (where N is the number of
   cores), where each worker takes a value from R1 and joins it "up the
   spine" of the tree -- so matches against the hash map in J1 and then
   J2. This approach (the phased approach) therefore has an I/O phase,
   followed by a computation phase.


   On OS X El Capitan (10.11.6), the queueing approach and the phased
   approach perform roughly equally (more variation in the run times
   for the queueing approach, but performance within 10% of each
   other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
   three times slower than the phased approach!

   On top of that the queueing approach runs over twice as fast on OS X
   as it is does on Linux, despite the fact that I am running OS X on
   old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
   Linux on relatively new hardware (6th Gen Intel Core i7-5500U).

   We have boiled the problem down to code that doesn't involve
   databases -- just threads and ArrayBlockingQueues. Attached are
   5 Java files which "approximate" the queueing approach by having
   each receive thread simply output numbers (rather than get values
   from a database) to an ArrayBlockingQueue, and
   each Join thread simply read its two input queues, adds the values
   together, and then output the sum to the output queue.

   Onthe same hardware above this code runs in 1 minute 30 secs on OS
   X, and 2 minutes 13 secs on Linux!

   The code is very simple -- just a bunch of threads and a bunch of
   ArrayBlockingQueues. So I am at a loss to explain the size-able
   performance differences noted above.

   Again apologies if this is not the right forum to discuss this issue.

   graham

   ​
   ------
   Graham MATTHEWS, Ph.D.
   Assistant Professor
   Department of Computer Science |  California Lutheran University
   60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
   [hidden email] <[hidden email]>  | Office
   location: D19

   _______________________________________________
   Concurrency-interest mailing list
   [hidden email]
   <[hidden email]>
   http://cs.oswego.edu/mailman/listinfo/concurrency-interest
<http://cs.oswego.edu/mailman/listinfo/concurrency-interest>




_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

--
-Nathan

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: strange performance issue with ArrayBlockingQueue on Linux

Alex Otenko
Or, to make the claim more obvious (reduced the number of items to move):

Ubuntu:
JOIN 1 : END - avg moved = 15.522080166396762 stops: 193273
JOIN 3 : END - avg moved = 11.284180712332459 stops: 265859
JOIN 4 : END - avg moved = 5.261479667895535 stops: 570182
JOIN 6 : END - avg moved = 5.253905401713407 stops: 571004
Num results = 1000000
JOIN 5 : END - avg moved = 9.410645946522266 stops: 318788
JOIN 2 : END - avg moved = 4.980204145797066 stops: 594771
JOIN 7 : END - avg moved = 13.271463266813882 stops: 226049

Mac:
JOIN 2 : END - avg moved = 191.25074912336663 stops: 15685
JOIN 5 : END - avg moved = 194.36389554850084 stops: 15433
JOIN 6 : END - avg moved = 270.7480144404339 stops: 11080
JOIN 7 : END - avg moved = 259.14675168653486 stops: 10821
JOIN 1 : END - avg moved = 231.73393077873897 stops: 12944
JOIN 3 : END - avg moved = 171.97678284797118 stops: 17444
JOIN 4 : END - avg moved = 216.7288490715988 stops: 13841


For Join.java counting how many times take or put would block:
    @Override
    public void run() {
        int moved = 0;
        double avg = 0;
        int stops = 0;
        try {
            while (true) {
                Long v = this.lhsInput.poll();
                if (v == null) {
                  avg = (avg * stops + moved) / (stops + 1);
                  stops++;
                  moved = 0;
                  v = this.lhsInput.take();
                }
                moved++;

                long lhsValue = v;

                v = this.rhsInput.poll();
                if (v == null) {
                  avg = (avg * stops + moved) / (stops + 1);
                  stops++;
                  moved = 0;
                  v = this.rhsInput.take();
                }
                moved++;

                long rhsValue = v;

                /* Check for end-of-stream markers. */
                if (lhsValue == Stream.EOS) {
                    this.eosCount++;
                }
                if (rhsValue == Stream.EOS) {
                    this.eosCount++;
                }

                /* Join values or send end-of-stream marker.*/
                if (this.eosCount == 2) {
                    break;
                }
                else {
                    if (!this.offer(lhsValue + rhsValue)) {
                      avg = (avg * stops + moved) / (stops + 1);
                      stops++;
                      moved = 0;
                      this.put(lhsValue + rhsValue);
                    }
                    moved++;
                }
            }
            this.put(Stream.EOS);
            System.out.println(this + " : END - avg moved = " + avg + " stops: " + stops);
        }
        catch (InterruptedException ex) {
            ex.printStackTrace(System.err);
        }
    }

(and a corresponding change to Stream.java to expose offer())

Alex

On 20 Jul 2017, at 23:05, Alex Otenko <[hidden email]> wrote:

This particular case is not about the cost of a context switch, and not about the number of context switches per second. This is about the amount of work done per switch.


Consider J2. J2 moving data unblocks J1 and J3. If J1 gets on the CPU, the context switch allows J1 to move just one data item, then it gets blocked again. If J3 gets on the CPU, J2 moves as much data as possible - basically, emptying J1’s buffer, so when J2 is blocked, it created a lot of work for both J3 (kept feeding its input queue) and J1 (emptied its output queue), so when J1 does get on the core again, it can move a lot of items upstream to J2 (which in turn is better if J2 waits till there’s a lot of work available, instead of getting agitated the moment a single item is in its queue).

Example change in Join.java (synthetically biasing who gets blocked when) dramatically changed times for my Ubuntu (on Virtual Box):

    @Override
    public void run() {
        long [] arr = new long[1000];
        int pos = 0;
        try {
            while (true) {
                long lhsValue = this.lhsInput.take();
                long rhsValue = this.rhsInput.take();

                /* Check for end-of-stream markers. */
                if (lhsValue == Stream.EOS) {
                    this.eosCount++;
                }
                if (rhsValue == Stream.EOS) {
                    this.eosCount++;
                }

                /* Join values or send end-of-stream marker.*/
                if (this.eosCount == 2) {
                    for(int i = 0; i < pos; i++) {
                      this.put(arr[i]);
                    }
                    break;
                }
                else {
                    arr[pos++] = lhsValue + rhsValue;
                    if (pos == arr.length) {
                      for(int i = 0; i < pos; i++) {
                        this.put(arr[i]);
                      }
                      pos = 0;
                    }
                }
            }
            this.put(Stream.EOS);
            System.out.println(this + " : END");
        }
        catch (InterruptedException ex) {
            ex.printStackTrace(System.err);
        }
    }


Alex


On 20 Jul 2017, at 15:45, Nathan and Ila Reynolds <[hidden email]> wrote:

> I've never understood why unpark always comes up as a performance bottleneck. What does it do that's so expensive?

A while ago, I talked with the engineer who has done a lot of optimization on LockSupport.unpark().  I can't remember if that implies Unsafe.unpark().

On Windows, LockSupport.unpark() goes into the kernel 1 time to set a Windows Event object.  On Linux, LockSupport.unpark() acquires a lock, sets a condition wait and releases a lock.  This *might* require 3 trips into the kernel.

Furthermore, the scheduler might let the unparked thread start running and unparking thread yield (if no logical cores are available).  Hence, unpark() will take even more time.

-Nathan

On 7/19/2017 10:50 PM, Brian S O'Neill wrote:
You also didn't mention CPU core count. I ran the test on two machines, with Java 8 and with Java 9. With Java 9 I explicitly selected the parallel GC, because the G1 collector in Java 9 is a CPU pig that slows everything down.

Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores

Machine 1, Java 8: 9.2 seconds
Machine 1, Java 9: 9.1 seconds
Machine 2, Java 8: 16.9 seconds
Machine 2, Java 9: 12.1 seconds

At first blush it looks like the problem is Linux, but I don't have a proper apples-to-apples comparison. I noticed that with Java 8, the number of context switches per second was about 2.5 million, and with Java 9 it was about 2.3 million. Either way, this is really high. On Windows, the context switch rate was about 1.9 million per second.

I also see a lot of variation between test runs, and when I run the code in a loop, the performance (on Windows) varies from 9 to 14 seconds per run. On Linux, I see 11 to 21 seconds when running in a loop. I also added an explicit System.gc between each test run.

When I run with -Xprof, I see the threads spending most of their time in Unsafe.unpark and Unsafe.park. I've never understood why unpark always comes up as a performance bottleneck. What does it do that's so expensive?


On 2017-07-19 09:11 PM, Matthews, Graham wrote:
​Ah yes, sorry I should have said we are working in Java 8.

graham


------
Graham MATTHEWS, Ph.D.
Assistant Professor
Department of Computer Science |  California Lutheran University
60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
[hidden email]  | Office location: D19
------------------------------------------------------------------------
*From:* Martin Buchholz <[hidden email]>
*Sent:* Wednesday, July 19, 2017 8:57 PM
*To:* Matthews, Graham
*Cc:* [hidden email]
*Subject:* Re: [concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux
One obvious variable is the version of ABQ you are using.  It was modified in jdk9 and you should be doing all performance work with latest jdk9 binaries e.g. from http://jdk.java.net/9/

On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham <[hidden email] <[hidden email]>> wrote:

   ​[apologies in advance if this is the wrong place to discuss this
   issue. also apologies for the long setup -- I promise this is about
   ArrayBlockingQueue, Java, and a performance issue with them]


   My student and I have been benchmarking various ways to run cross
   data source joins. The basic setup is as follows:


               J2

               /  \

              /    \

            R3  J1

                  /  \

                 /    \

               R2  R1

   R1, R2 and R3 are called receive operators as they receive data from
   the databases. J1 and J2 are inner join operators joining values
   from R2 with values from R1 and then the result with values from R3.

   One way to run these queries is the classic producers and consumers
   approach -- so each operator is a Java thread, and there are
   bounded queues (ArrayBlockingQueues) between the operators. So the
   above example has 5 threads and 4 ArrayBlockingQueues. So here we
   interleave I/O with computation.

   Another way to run this is to first load R3 into J2 in parallel with
   loading R2 into J1 (the loads build hash maps in J1 and J2), and
   then have a thread pool with N workers (where N is the number of
   cores), where each worker takes a value from R1 and joins it "up the
   spine" of the tree -- so matches against the hash map in J1 and then
   J2. This approach (the phased approach) therefore has an I/O phase,
   followed by a computation phase.


   On OS X El Capitan (10.11.6), the queueing approach and the phased
   approach perform roughly equally (more variation in the run times
   for the queueing approach, but performance within 10% of each
   other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
   three times slower than the phased approach!

   On top of that the queueing approach runs over twice as fast on OS X
   as it is does on Linux, despite the fact that I am running OS X on
   old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
   Linux on relatively new hardware (6th Gen Intel Core i7-5500U).

   We have boiled the problem down to code that doesn't involve
   databases -- just threads and ArrayBlockingQueues. Attached are
   5 Java files which "approximate" the queueing approach by having
   each receive thread simply output numbers (rather than get values
   from a database) to an ArrayBlockingQueue, and
   each Join thread simply read its two input queues, adds the values
   together, and then output the sum to the output queue.

   Onthe same hardware above this code runs in 1 minute 30 secs on OS
   X, and 2 minutes 13 secs on Linux!

   The code is very simple -- just a bunch of threads and a bunch of
   ArrayBlockingQueues. So I am at a loss to explain the size-able
   performance differences noted above.

   Again apologies if this is not the right forum to discuss this issue.

   graham

   ​
   ------
   Graham MATTHEWS, Ph.D.
   Assistant Professor
   Department of Computer Science |  California Lutheran University
   60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
   [hidden email] <[hidden email]>  | Office
   location: D19

   _______________________________________________
   Concurrency-interest mailing list
   [hidden email]
   <[hidden email]>
   http://cs.oswego.edu/mailman/listinfo/concurrency-interest
<http://cs.oswego.edu/mailman/listinfo/concurrency-interest>




_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

--
-Nathan

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: strange performance issue with ArrayBlockingQueue on Linux

Matthews, Graham
In reply to this post by Brian S O'Neill-3
The core count on both machines is the same -- dual cores with hyper-threading, so 4 "cores".

I have repeated the test on the Mac on JDK 9, and get the same performance figures for the Mac (1 minute 30 secs).

I have also checked the context switching level, and OS X is running the code at about 50-100K context switches per second (so an order of magnitude fewer context switches per second than you report).

We have been unable to get JDK 9 running on our Linux box, but are working on it. I will post when I have the results for our Linux box.

We had previously run this code with profiling and like you found the unpark/park issue. We had also noticed that on OS X almost 1/2 as much time is spent in park/unpark as on Linux.

------
Graham MATTHEWS, Ph.D.
Assistant Professor
Department of Computer Science |  California Lutheran University
60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
[hidden email]  | Office location: D19

________________________________________
From: Concurrency-interest <[hidden email]> on behalf of Brian S O'Neill <[hidden email]>
Sent: Wednesday, July 19, 2017 9:50 PM
To: [hidden email]
Subject: Re: [concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux

You also didn't mention CPU core count. I ran the test on two machines,
with Java 8 and with Java 9. With Java 9 I explicitly selected the
parallel GC, because the G1 collector in Java 9 is a CPU pig that slows
everything down.

Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores

Machine 1, Java 8: 9.2 seconds
Machine 1, Java 9: 9.1 seconds
Machine 2, Java 8: 16.9 seconds
Machine 2, Java 9: 12.1 seconds

At first blush it looks like the problem is Linux, but I don't have a
proper apples-to-apples comparison. I noticed that with Java 8, the
number of context switches per second was about 2.5 million, and with
Java 9 it was about 2.3 million. Either way, this is really high. On
Windows, the context switch rate was about 1.9 million per second.

I also see a lot of variation between test runs, and when I run the code
in a loop, the performance (on Windows) varies from 9 to 14 seconds per
run. On Linux, I see 11 to 21 seconds when running in a loop. I also
added an explicit System.gc between each test run.

When I run with -Xprof, I see the threads spending most of their time in
Unsafe.unpark and Unsafe.park. I've never understood why unpark always
comes up as a performance bottleneck. What does it do that's so expensive?


On 2017-07-19 09:11 PM, Matthews, Graham wrote:

> ​Ah yes, sorry I should have said we are working in Java 8.
>
> graham
>
>
> ------
> Graham MATTHEWS, Ph.D.
> Assistant Professor
> Department of Computer Science |  California Lutheran University
> 60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
> [hidden email]  | Office location: D19
> ------------------------------------------------------------------------
> *From:* Martin Buchholz <[hidden email]>
> *Sent:* Wednesday, July 19, 2017 8:57 PM
> *To:* Matthews, Graham
> *Cc:* [hidden email]
> *Subject:* Re: [concurrency-interest] strange performance issue with
> ArrayBlockingQueue on Linux
> One obvious variable is the version of ABQ you are using.  It was
> modified in jdk9 and you should be doing all performance work with
> latest jdk9 binaries e.g. from http://jdk.java.net/9/
>
> On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     ​[apologies in advance if this is the wrong place to discuss this
>     issue. also apologies for the long setup -- I promise this is about
>     ArrayBlockingQueue, Java, and a performance issue with them]
>
>
>     My student and I have been benchmarking various ways to run cross
>     data source joins. The basic setup is as follows:
>
>
>                 J2
>
>                 /  \
>
>                /    \
>
>              R3  J1
>
>                    /  \
>
>                   /    \
>
>                 R2  R1
>
>     R1, R2 and R3 are called receive operators as they receive data from
>     the databases. J1 and J2 are inner join operators joining values
>     from R2 with values from R1 and then the result with values from R3.
>
>     One way to run these queries is the classic producers and consumers
>     approach -- so each operator is a Java thread, and there are
>     bounded queues (ArrayBlockingQueues) between the operators. So the
>     above example has 5 threads and 4 ArrayBlockingQueues. So here we
>     interleave I/O with computation.
>
>     Another way to run this is to first load R3 into J2 in parallel with
>     loading R2 into J1 (the loads build hash maps in J1 and J2), and
>     then have a thread pool with N workers (where N is the number of
>     cores), where each worker takes a value from R1 and joins it "up the
>     spine" of the tree -- so matches against the hash map in J1 and then
>     J2. This approach (the phased approach) therefore has an I/O phase,
>     followed by a computation phase.
>
>
>     On OS X El Capitan (10.11.6), the queueing approach and the phased
>     approach perform roughly equally (more variation in the run times
>     for the queueing approach, but performance within 10% of each
>     other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
>     three times slower than the phased approach!
>
>     On top of that the queueing approach runs over twice as fast on OS X
>     as it is does on Linux, despite the fact that I am running OS X on
>     old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
>     Linux on relatively new hardware (6th Gen Intel Core i7-5500U).
>
>     We have boiled the problem down to code that doesn't involve
>     databases -- just threads and ArrayBlockingQueues. Attached are
>     5 Java files which "approximate" the queueing approach by having
>     each receive thread simply output numbers (rather than get values
>     from a database) to an ArrayBlockingQueue, and
>     each Join thread simply read its two input queues, adds the values
>     together, and then output the sum to the output queue.
>
>     Onthe same hardware above this code runs in 1 minute 30 secs on OS
>     X, and 2 minutes 13 secs on Linux!
>
>     The code is very simple -- just a bunch of threads and a bunch of
>     ArrayBlockingQueues. So I am at a loss to explain the size-able
>     performance differences noted above.
>
>     Again apologies if this is not the right forum to discuss this issue.
>
>     graham
>
>     ​
>     ------
>     Graham MATTHEWS, Ph.D.
>     Assistant Professor
>     Department of Computer Science |  California Lutheran University
>     60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>     [hidden email] <mailto:[hidden email]>  | Office
>     location: D19
>
>     _______________________________________________
>     Concurrency-interest mailing list
>     [hidden email]
>     <mailto:[hidden email]>
>     http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>     <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>
>
>
>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: strange performance issue with ArrayBlockingQueue on Linux

Martin Buchholz-3


On Fri, Jul 21, 2017 at 11:29 AM, Matthews, Graham <[hidden email]> wrote:

We had previously run this code with profiling and like you found the unpark/park issue. We had also noticed that on OS X almost 1/2 as much time is spent in park/unpark as on Linux.

Through experience I am distrustful of profilers, but I am especially distrustful of any measurements around park/unpark.  Profilers are often deeply confused by thread suspension. 

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: strange performance issue with ArrayBlockingQueue on Linux

Matthews, Graham
In reply to this post by Nathan and Ila Reynolds
I would like to dig down further into this issue as I am studying whether or not the traditional "threads and queues" approach is really slower than modern approaches (like SEDA), or whether the supposed slowness of the traditional approach is simply a function of the underlying threading implementation.

The results on OS X, where the traditional and modern approach have similar performance, is food for thought in this respect.

Any suggestions as to who to contact next to understand/improve the Linux performance would me most welcome.

graham

PS: I have confirmed that switching to JDK 9 doesn't change anything. If anything the code is even slower on JDK 9 than 8.

------
Graham MATTHEWS, Ph.D.
Assistant Professor
Department of Computer Science |  California Lutheran University
60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
[hidden email]  | Office location: D19

________________________________________
From: Concurrency-interest <[hidden email]> on behalf of Nathan and Ila Reynolds <[hidden email]>
Sent: Thursday, July 20, 2017 7:45 AM
To: [hidden email]
Subject: Re: [concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux

 > I've never understood why unpark always comes up as a performance
bottleneck. What does it do that's so expensive?

A while ago, I talked with the engineer who has done a lot of
optimization on LockSupport.unpark().  I can't remember if that implies
Unsafe.unpark().

On Windows, LockSupport.unpark() goes into the kernel 1 time to set a
Windows Event object.  On Linux, LockSupport.unpark() acquires a lock,
sets a condition wait and releases a lock.  This *might* require 3 trips
into the kernel.

Furthermore, the scheduler might let the unparked thread start running
and unparking thread yield (if no logical cores are available).  Hence,
unpark() will take even more time.

-Nathan

On 7/19/2017 10:50 PM, Brian S O'Neill wrote:

> You also didn't mention CPU core count. I ran the test on two
> machines, with Java 8 and with Java 9. With Java 9 I explicitly
> selected the parallel GC, because the G1 collector in Java 9 is a CPU
> pig that slows everything down.
>
> Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
> Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores
>
> Machine 1, Java 8: 9.2 seconds
> Machine 1, Java 9: 9.1 seconds
> Machine 2, Java 8: 16.9 seconds
> Machine 2, Java 9: 12.1 seconds
>
> At first blush it looks like the problem is Linux, but I don't have a
> proper apples-to-apples comparison. I noticed that with Java 8, the
> number of context switches per second was about 2.5 million, and with
> Java 9 it was about 2.3 million. Either way, this is really high. On
> Windows, the context switch rate was about 1.9 million per second.
>
> I also see a lot of variation between test runs, and when I run the
> code in a loop, the performance (on Windows) varies from 9 to 14
> seconds per run. On Linux, I see 11 to 21 seconds when running in a
> loop. I also added an explicit System.gc between each test run.
>
> When I run with -Xprof, I see the threads spending most of their time
> in Unsafe.unpark and Unsafe.park. I've never understood why unpark
> always comes up as a performance bottleneck. What does it do that's so
> expensive?
>
>
> On 2017-07-19 09:11 PM, Matthews, Graham wrote:
>> ​Ah yes, sorry I should have said we are working in Java 8.
>>
>> graham
>>
>>
>> ------
>> Graham MATTHEWS, Ph.D.
>> Assistant Professor
>> Department of Computer Science |  California Lutheran University
>> 60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>> [hidden email]  | Office location: D19
>> ------------------------------------------------------------------------
>> *From:* Martin Buchholz <[hidden email]>
>> *Sent:* Wednesday, July 19, 2017 8:57 PM
>> *To:* Matthews, Graham
>> *Cc:* [hidden email]
>> *Subject:* Re: [concurrency-interest] strange performance issue with
>> ArrayBlockingQueue on Linux
>> One obvious variable is the version of ABQ you are using.  It was
>> modified in jdk9 and you should be doing all performance work with
>> latest jdk9 binaries e.g. from http://jdk.java.net/9/
>>
>> On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham
>> <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>     ​[apologies in advance if this is the wrong place to discuss this
>>     issue. also apologies for the long setup -- I promise this is about
>>     ArrayBlockingQueue, Java, and a performance issue with them]
>>
>>
>>     My student and I have been benchmarking various ways to run cross
>>     data source joins. The basic setup is as follows:
>>
>>
>>                 J2
>>
>>                 /  \
>>
>>                /    \
>>
>>              R3  J1
>>
>>                    /  \
>>
>>                   /    \
>>
>>                 R2  R1
>>
>>     R1, R2 and R3 are called receive operators as they receive data from
>>     the databases. J1 and J2 are inner join operators joining values
>>     from R2 with values from R1 and then the result with values from R3.
>>
>>     One way to run these queries is the classic producers and consumers
>>     approach -- so each operator is a Java thread, and there are
>>     bounded queues (ArrayBlockingQueues) between the operators. So the
>>     above example has 5 threads and 4 ArrayBlockingQueues. So here we
>>     interleave I/O with computation.
>>
>>     Another way to run this is to first load R3 into J2 in parallel with
>>     loading R2 into J1 (the loads build hash maps in J1 and J2), and
>>     then have a thread pool with N workers (where N is the number of
>>     cores), where each worker takes a value from R1 and joins it "up the
>>     spine" of the tree -- so matches against the hash map in J1 and then
>>     J2. This approach (the phased approach) therefore has an I/O phase,
>>     followed by a computation phase.
>>
>>
>>     On OS X El Capitan (10.11.6), the queueing approach and the phased
>>     approach perform roughly equally (more variation in the run times
>>     for the queueing approach, but performance within 10% of each
>>     other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
>>     three times slower than the phased approach!
>>
>>     On top of that the queueing approach runs over twice as fast on OS X
>>     as it is does on Linux, despite the fact that I am running OS X on
>>     old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
>>     Linux on relatively new hardware (6th Gen Intel Core i7-5500U).
>>
>>     We have boiled the problem down to code that doesn't involve
>>     databases -- just threads and ArrayBlockingQueues. Attached are
>>     5 Java files which "approximate" the queueing approach by having
>>     each receive thread simply output numbers (rather than get values
>>     from a database) to an ArrayBlockingQueue, and
>>     each Join thread simply read its two input queues, adds the values
>>     together, and then output the sum to the output queue.
>>
>>     Onthe same hardware above this code runs in 1 minute 30 secs on OS
>>     X, and 2 minutes 13 secs on Linux!
>>
>>     The code is very simple -- just a bunch of threads and a bunch of
>>     ArrayBlockingQueues. So I am at a loss to explain the size-able
>>     performance differences noted above.
>>
>>     Again apologies if this is not the right forum to discuss this
>> issue.
>>
>>     graham
>>
>>     ​
>>     ------
>>     Graham MATTHEWS, Ph.D.
>>     Assistant Professor
>>     Department of Computer Science |  California Lutheran University
>>     60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>>     [hidden email] <mailto:[hidden email]>  | Office
>>     location: D19
>>
>>     _______________________________________________
>>     Concurrency-interest mailing list
>>     [hidden email]
>>     <mailto:[hidden email]>
>>     http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>> <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>>
>>
>>
>>
>> _______________________________________________
>> Concurrency-interest mailing list
>> [hidden email]
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

--
-Nathan

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: strange performance issue with ArrayBlockingQueue on Linux

Alex Otenko
Are you studying scheduler performance for complex dependency graphs?

For example, when I look at my Linux behaviour, I see that J* threads move only a few items at a time before blocking, whereas Mac is able to move an order of magnitude more (and completes in my case an order of magnitude faster - your mileage varies, but maybe that’s also linked to how much the threads move before blocking).

Solving this problem is not easy, because the optimal scheduler needs to know that they do not need to schedule J3 or J1 too soon - that it is more profitable to schedule J1 when more space is freed in the queue by J2 (and in the mean time use the CPU to run something else), and to schedule J3 when more space is used up in the queue by J2 - so that J3 moves many items to J4, and J4 moves many items to J5, etc, instead of spending one context switch per one move. So the cost of context switch is not so much an impact as scheduling the threads *late* enough.

Steering scheduling here would be an interesting problem.


Alex

> On 25 Jul 2017, at 01:03, Matthews, Graham <[hidden email]> wrote:
>
> I would like to dig down further into this issue as I am studying whether or not the traditional "threads and queues" approach is really slower than modern approaches (like SEDA), or whether the supposed slowness of the traditional approach is simply a function of the underlying threading implementation.
>
> The results on OS X, where the traditional and modern approach have similar performance, is food for thought in this respect.
>
> Any suggestions as to who to contact next to understand/improve the Linux performance would me most welcome.
>
> graham
>
> PS: I have confirmed that switching to JDK 9 doesn't change anything. If anything the code is even slower on JDK 9 than 8.
>
> ------
> Graham MATTHEWS, Ph.D.
> Assistant Professor
> Department of Computer Science |  California Lutheran University
> 60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
> [hidden email]  | Office location: D19
>
> ________________________________________
> From: Concurrency-interest <[hidden email]> on behalf of Nathan and Ila Reynolds <[hidden email]>
> Sent: Thursday, July 20, 2017 7:45 AM
> To: [hidden email]
> Subject: Re: [concurrency-interest] strange performance issue with ArrayBlockingQueue on Linux
>
>> I've never understood why unpark always comes up as a performance
> bottleneck. What does it do that's so expensive?
>
> A while ago, I talked with the engineer who has done a lot of
> optimization on LockSupport.unpark().  I can't remember if that implies
> Unsafe.unpark().
>
> On Windows, LockSupport.unpark() goes into the kernel 1 time to set a
> Windows Event object.  On Linux, LockSupport.unpark() acquires a lock,
> sets a condition wait and releases a lock.  This *might* require 3 trips
> into the kernel.
>
> Furthermore, the scheduler might let the unparked thread start running
> and unparking thread yield (if no logical cores are available).  Hence,
> unpark() will take even more time.
>
> -Nathan
>
> On 7/19/2017 10:50 PM, Brian S O'Neill wrote:
>> You also didn't mention CPU core count. I ran the test on two
>> machines, with Java 8 and with Java 9. With Java 9 I explicitly
>> selected the parallel GC, because the G1 collector in Java 9 is a CPU
>> pig that slows everything down.
>>
>> Machine 1: Windows 10, Intel Xeon E31245 3.3GHz, 8 logical cores
>> Machine 2: Linux 4.10, AMD Ryzen 7 1700 3.0Ghz, 16 logical cores
>>
>> Machine 1, Java 8: 9.2 seconds
>> Machine 1, Java 9: 9.1 seconds
>> Machine 2, Java 8: 16.9 seconds
>> Machine 2, Java 9: 12.1 seconds
>>
>> At first blush it looks like the problem is Linux, but I don't have a
>> proper apples-to-apples comparison. I noticed that with Java 8, the
>> number of context switches per second was about 2.5 million, and with
>> Java 9 it was about 2.3 million. Either way, this is really high. On
>> Windows, the context switch rate was about 1.9 million per second.
>>
>> I also see a lot of variation between test runs, and when I run the
>> code in a loop, the performance (on Windows) varies from 9 to 14
>> seconds per run. On Linux, I see 11 to 21 seconds when running in a
>> loop. I also added an explicit System.gc between each test run.
>>
>> When I run with -Xprof, I see the threads spending most of their time
>> in Unsafe.unpark and Unsafe.park. I've never understood why unpark
>> always comes up as a performance bottleneck. What does it do that's so
>> expensive?
>>
>>
>> On 2017-07-19 09:11 PM, Matthews, Graham wrote:
>>> ​Ah yes, sorry I should have said we are working in Java 8.
>>>
>>> graham
>>>
>>>
>>> ------
>>> Graham MATTHEWS, Ph.D.
>>> Assistant Professor
>>> Department of Computer Science |  California Lutheran University
>>> 60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>>> [hidden email]  | Office location: D19
>>> ------------------------------------------------------------------------
>>> *From:* Martin Buchholz <[hidden email]>
>>> *Sent:* Wednesday, July 19, 2017 8:57 PM
>>> *To:* Matthews, Graham
>>> *Cc:* [hidden email]
>>> *Subject:* Re: [concurrency-interest] strange performance issue with
>>> ArrayBlockingQueue on Linux
>>> One obvious variable is the version of ABQ you are using.  It was
>>> modified in jdk9 and you should be doing all performance work with
>>> latest jdk9 binaries e.g. from http://jdk.java.net/9/
>>>
>>> On Wed, Jul 19, 2017 at 8:41 PM, Matthews, Graham
>>> <[hidden email] <mailto:[hidden email]>> wrote:
>>>
>>>    ​[apologies in advance if this is the wrong place to discuss this
>>>    issue. also apologies for the long setup -- I promise this is about
>>>    ArrayBlockingQueue, Java, and a performance issue with them]
>>>
>>>
>>>    My student and I have been benchmarking various ways to run cross
>>>    data source joins. The basic setup is as follows:
>>>
>>>
>>>                J2
>>>
>>>                /  \
>>>
>>>               /    \
>>>
>>>             R3  J1
>>>
>>>                   /  \
>>>
>>>                  /    \
>>>
>>>                R2  R1
>>>
>>>    R1, R2 and R3 are called receive operators as they receive data from
>>>    the databases. J1 and J2 are inner join operators joining values
>>>    from R2 with values from R1 and then the result with values from R3.
>>>
>>>    One way to run these queries is the classic producers and consumers
>>>    approach -- so each operator is a Java thread, and there are
>>>    bounded queues (ArrayBlockingQueues) between the operators. So the
>>>    above example has 5 threads and 4 ArrayBlockingQueues. So here we
>>>    interleave I/O with computation.
>>>
>>>    Another way to run this is to first load R3 into J2 in parallel with
>>>    loading R2 into J1 (the loads build hash maps in J1 and J2), and
>>>    then have a thread pool with N workers (where N is the number of
>>>    cores), where each worker takes a value from R1 and joins it "up the
>>>    spine" of the tree -- so matches against the hash map in J1 and then
>>>    J2. This approach (the phased approach) therefore has an I/O phase,
>>>    followed by a computation phase.
>>>
>>>
>>>    On OS X El Capitan (10.11.6), the queueing approach and the phased
>>>    approach perform roughly equally (more variation in the run times
>>>    for the queueing approach, but performance within 10% of each
>>>    other). On Linux (Ubuntu 16.04 LTS)the queueing approach is almost
>>>    three times slower than the phased approach!
>>>
>>>    On top of that the queueing approach runs over twice as fast on OS X
>>>    as it is does on Linux, despite the fact that I am running OS X on
>>>    old hardware (mid 2012 Macbook Pro, 3rd Gen Intel Core i5 3210M) and
>>>    Linux on relatively new hardware (6th Gen Intel Core i7-5500U).
>>>
>>>    We have boiled the problem down to code that doesn't involve
>>>    databases -- just threads and ArrayBlockingQueues. Attached are
>>>    5 Java files which "approximate" the queueing approach by having
>>>    each receive thread simply output numbers (rather than get values
>>>    from a database) to an ArrayBlockingQueue, and
>>>    each Join thread simply read its two input queues, adds the values
>>>    together, and then output the sum to the output queue.
>>>
>>>    Onthe same hardware above this code runs in 1 minute 30 secs on OS
>>>    X, and 2 minutes 13 secs on Linux!
>>>
>>>    The code is very simple -- just a bunch of threads and a bunch of
>>>    ArrayBlockingQueues. So I am at a loss to explain the size-able
>>>    performance differences noted above.
>>>
>>>    Again apologies if this is not the right forum to discuss this
>>> issue.
>>>
>>>    graham
>>>
>>>    ​
>>>    ------
>>>    Graham MATTHEWS, Ph.D.
>>>    Assistant Professor
>>>    Department of Computer Science |  California Lutheran University
>>>    60 W. Olsen Rd.  |  Thousand Oaks, CA 91360
>>>    [hidden email] <mailto:[hidden email]>  | Office
>>>    location: D19
>>>
>>>    _______________________________________________
>>>    Concurrency-interest mailing list
>>>    [hidden email]
>>>    <mailto:[hidden email]>
>>>    http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>> <http://cs.oswego.edu/mailman/listinfo/concurrency-interest>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Concurrency-interest mailing list
>>> [hidden email]
>>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>>>
>> _______________________________________________
>> Concurrency-interest mailing list
>> [hidden email]
>> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
> --
> -Nathan
>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest