juc and GC (plus FJ updates)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

juc and GC (plus FJ updates)

JSR166 Concurrency mailing list

As nicely pointed out recently by Nitsan Wakart (with contributions
from Aleksey Shipilev)
(http://psy-lob-saw.blogspot.com/2018/01/what-difference-jvm-makes.html)
performance of concurrent data structures can vary a lot across
garbage collectors. Of which there are at least nine: ParallelGC,
ConcMarkSweepGC, G1, Shenandoah, ZGC, Zing, IBM-J9 GC, Android, each
with different low-level implementations across processors (X86, ARM,
etc), and each with many tuning options.  Different properties of
collectors interact with concurrency: Write barriers may add ordering
and low-level fencing, read-barriers may widen CAS windows, GC pausing
interacts with blocking for synchronization, safepoints interact with
progress, and placing and moving objects interact with cache pollution
and thread-to-memory affinity.  We'd like j.u.c components to work
reasonably under different collectors (and different JVMs). Which is
challenging if not impossible in general, but it is worth the effort
to do the best we can for some of the most commonly used classes like
ConcurrentHashMap and ForkJoin.  (Some others contain too many
user-selected options to systematically deal with; for example
ThreadPoolExecutor allows any kind of Queue to be used.)

One precaution is to cope with the fact that nearly all collectors use
cardmarks, amplifying false-sharing effects by creating memory
contention for reference writes that happen to be nearby in memory
even if not on same cacheline.  The performance impact is just as bad
as when multiple threads are repeatedly writing the same location (8X
slowdowns are not rare).  GCs also tend to move elements of linked
data structures closer together after a garbage collection, which
sometimes causes programs to get slower over time.  For array-based
collections, one workaround is to over-allocate such that different
arrays that are frequently written by different threads are far enough
apart.  The target value is such that bytes marking addresses from
different arrays are on different cachelines. Emprically, 8Kbytes of
separation works across at least most collectors. Under ParallelGC,
you can band-aid this with -XX:+UseCondCardMark, which reduces most
cases of write-write contention to read-write, leading to less cache
traffic.  Even though FJ guards workstealing arrays by overallocating,
(and normally-populated ConcurrentHashMaps intrinsically do so for
their arrays) UseCondCardMark usually improves performance by reducing
impact of accessing bookeeping fields of other objects like executed
FJ tasks. (It doesn't always improve because of extra per-write
overhead that might not have been needed, but even if so, impact is
small.)  Collectors that don't have this option are more prone to
occasional mysterious big slowdowns.

Contemplation of other GC mechanics sometimes reduces other negative
impact. For example, if a reference write might sometimes entail a
memory fence anyway, then you might prefer basing explicitly atomic
operations on them vs surrounding them with locks. ConcurrentHashMap,
ConcurrentLinkedQueue, CompletableFuture, SubmissionPublisher, and
ForkJoin, mostly do this. I just locally (jsr166 CVS) committed some
FJ updates that do this in more cases and add a few other tweaky
adjustments.  The impact on ParallelGC is small, but others now seem
more well-behaved.

Across various tests that mainly exercise concurrency (without doing
very much else), in hostpot, -XX:+UseParallelGC -XX:+UseCondCardMark
nearly always gives best throughput, even better than no-GC (Epsilon),
but with occasional long GC pauses. In most cases, G1 is around 10%
slower; sometimes moreso. I don't have enough empirical experience with
newer in-progress collectors or other JVMs to say anything about them
yet in general.  As always, the main determiners of performance in
ForkJoin (as well as ThreadPoolExecutor and other async frameworks)
are task issue rate (granularity) and isolation (including
not blocking within tasks).  If async tasks contain around 10,000
thread-local instructions, GC choices depend less on concurrency per
se, and more on throughput/latency choices etc.

Please try out tentative updates in
http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166.jar (running with java
--patch-module java.base="$DIR/jsr166.jar"). We also are always
looking to add more performance tests that are informative about basic
functionality or interactions with other frameworks.  Contributions
welcome.  If you are interested in further exploring impact on
hotspot, try out Aleksey's builds with some of the new in-progress
GCs, at https://builds.shipilev.net/

Aside: All of the above emphasize that we are in the era of the "killer
microsecond", where most performance challenges are not handled well
by nanosecod-level (usually CPU-based, like instruction parallelism)
or millisecond-level (usually OS-based, like IO) techniques.  See
https://cacm.acm.org/magazines/2017/4/215032-attack-of-the-killer-microseconds/fulltext

-Doug

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: juc and GC (plus FJ updates)

JSR166 Concurrency mailing list
I agree that some level of conditional card marking (like -XX:+UseCondCardMark in ParallelGC) can be critical to avoid sharing contention on card mark. With HotSpot's current byte-per-card and card-per-512-bytes-of-heap ratios, there is real sharing of individual card marks [representing 512 bytes of heap each], false sharing on a card mark line [representing 64 cards on x86] and false sharing for heap reference stores [all stores to the same 32KB-aligned region of heap (in x86) represent a store to the same cardmark cache line]. Arguably, in a world where the likelihood of multiple threads mutating references that are stored in the same 32KB of heap is pretty high, conditional card marking should arguably be a default.

Conditional card marking (in the sense that the mark does not dirty an already dirty card) is [currently] more expensive primarily because it involves reading the card value for evaluation. When this extra cost is paid on every reference mutation, it can offset much of the scaling gains, and presents an annoying tradeoff choice between single-threaded performance and multi-threaded scalability.

But the extra cost of conditional card marking can be combatted fairly effectively. For reference, in Zing we've worked hard to eliminate these tradeoffs (our designs started off on 384-way Vega SMPs, so false sharing on card marks was an obvious non-starter). Zing's card marking scheme uses conditional marking at all times, which avoids line contention at one level (writers to the same card or card cache line). We also use precise card marking [a card bit per heap word rather than a card byte per 512 heap bytes], which completely eliminates all card-related false sharing. In addition, we reduce the cost of the read-and-maybe-write: our conditional card modification check (which requires a card read) is much cheaper in practice because it is fronted by a conditional newgen-ref-being-stored-into-oldgen-location check which is purely based on addresses, does not require any reads, and eliminates 95%-99%+ of card modification attempts.

Historically, the main design reason against bit-based card marking had been that unconditional card marks are faster when using a byte per card (it's a fire-and-forget byte store operation, with no need for a read-modify-write thing). However, once you buy into the conditional card marking notion (with or without generation-based filter ahead of it) , you've introduced a read-check-and-maybe-write thing anyway, making a read-check-and-maybe-modify-and-write thing not that big a deal. The steps from there to precise (1 card per heap word) card marking then become pretty obvious.

I'd encourage other card-marking collectors (which is pretty much all current and most future generational schemes) to look at similar notions for the card table as they can help reduce multi-threaded scaling sensitivity of most workloads. Specifically, reducing the cost of conditional card marking (by e.g. fronting it with a cross-generation-contitional filter) can help eliminate the notion that a tradeoff is involved when turning on -XX:+UseCondCardMark. With the current 32KB-wide reference-mutation-contention window in HotSpot, false-card-line-sharing contention can easily hit unsuspecting and even unrelated data structure, and the obvious healthy default should avoid that.

— Gil.

> On Jan 16, 2018, at 9:40 AM, Doug Lea via Concurrency-interest <[hidden email]> wrote:
>
>
> As nicely pointed out recently by Nitsan Wakart (with contributions
> from Aleksey Shipilev)
> (http://psy-lob-saw.blogspot.com/2018/01/what-difference-jvm-makes.html)
> performance of concurrent data structures can vary a lot across
> garbage collectors. Of which there are at least nine: ParallelGC,
> ConcMarkSweepGC, G1, Shenandoah, ZGC, Zing, IBM-J9 GC, Android, each
> with different low-level implementations across processors (X86, ARM,
> etc), and each with many tuning options.  Different properties of
> collectors interact with concurrency: Write barriers may add ordering
> and low-level fencing, read-barriers may widen CAS windows, GC pausing
> interacts with blocking for synchronization, safepoints interact with
> progress, and placing and moving objects interact with cache pollution
> and thread-to-memory affinity.  We'd like j.u.c components to work
> reasonably under different collectors (and different JVMs). Which is
> challenging if not impossible in general, but it is worth the effort
> to do the best we can for some of the most commonly used classes like
> ConcurrentHashMap and ForkJoin.  (Some others contain too many
> user-selected options to systematically deal with; for example
> ThreadPoolExecutor allows any kind of Queue to be used.)
>
> One precaution is to cope with the fact that nearly all collectors use
> cardmarks, amplifying false-sharing effects by creating memory
> contention for reference writes that happen to be nearby in memory
> even if not on same cacheline.  The performance impact is just as bad
> as when multiple threads are repeatedly writing the same location (8X
> slowdowns are not rare).  GCs also tend to move elements of linked
> data structures closer together after a garbage collection, which
> sometimes causes programs to get slower over time.  For array-based
> collections, one workaround is to over-allocate such that different
> arrays that are frequently written by different threads are far enough
> apart.  The target value is such that bytes marking addresses from
> different arrays are on different cachelines. Emprically, 8Kbytes of
> separation works across at least most collectors. Under ParallelGC,
> you can band-aid this with -XX:+UseCondCardMark, which reduces most
> cases of write-write contention to read-write, leading to less cache
> traffic.  Even though FJ guards workstealing arrays by overallocating,
> (and normally-populated ConcurrentHashMaps intrinsically do so for
> their arrays) UseCondCardMark usually improves performance by reducing
> impact of accessing bookeeping fields of other objects like executed
> FJ tasks. (It doesn't always improve because of extra per-write
> overhead that might not have been needed, but even if so, impact is
> small.)  Collectors that don't have this option are more prone to
> occasional mysterious big slowdowns.
>
> Contemplation of other GC mechanics sometimes reduces other negative
> impact. For example, if a reference write might sometimes entail a
> memory fence anyway, then you might prefer basing explicitly atomic
> operations on them vs surrounding them with locks. ConcurrentHashMap,
> ConcurrentLinkedQueue, CompletableFuture, SubmissionPublisher, and
> ForkJoin, mostly do this. I just locally (jsr166 CVS) committed some
> FJ updates that do this in more cases and add a few other tweaky
> adjustments.  The impact on ParallelGC is small, but others now seem
> more well-behaved.
>
> Across various tests that mainly exercise concurrency (without doing
> very much else), in hostpot, -XX:+UseParallelGC -XX:+UseCondCardMark
> nearly always gives best throughput, even better than no-GC (Epsilon),
> but with occasional long GC pauses. In most cases, G1 is around 10%
> slower; sometimes moreso. I don't have enough empirical experience with
> newer in-progress collectors or other JVMs to say anything about them
> yet in general.  As always, the main determiners of performance in
> ForkJoin (as well as ThreadPoolExecutor and other async frameworks)
> are task issue rate (granularity) and isolation (including
> not blocking within tasks).  If async tasks contain around 10,000
> thread-local instructions, GC choices depend less on concurrency per
> se, and more on throughput/latency choices etc.
>
> Please try out tentative updates in
> http://gee.cs.oswego.edu/dl/jsr166/dist/jsr166.jar (running with java
> --patch-module java.base="$DIR/jsr166.jar"). We also are always
> looking to add more performance tests that are informative about basic
> functionality or interactions with other frameworks.  Contributions
> welcome.  If you are interested in further exploring impact on
> hotspot, try out Aleksey's builds with some of the new in-progress
> GCs, at https://builds.shipilev.net/
>
> Aside: All of the above emphasize that we are in the era of the "killer
> microsecond", where most performance challenges are not handled well
> by nanosecod-level (usually CPU-based, like instruction parallelism)
> or millisecond-level (usually OS-based, like IO) techniques.  See
> https://cacm.acm.org/magazines/2017/4/215032-attack-of-the-killer-microseconds/fulltext
>
> -Doug
>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (891 bytes) Download Attachment