I posted a draft JEP about adding spinLoopHint() for discussion on core-libs-dev and hotspot-dev. May be of interest to this group. The main focus is supporting outside-of-the-JDK spinning needs (for which there are multiple eager users), but it could/may be useful under the hood in j.u.c.
http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-October/035613.html See draft JEP, tests, and links to prototype JDKs to play with here: https://github.com/giltene/GilExamples/tree/master/SpinHintTest — Gil. _______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
If you haven't seen it, you may also be interested in which seems to be a very different perspective on roughly the same space. On Tue, Oct 6, 2015 at 8:11 AM, Gil Tene <[hidden email]> wrote: I posted a draft JEP about adding spinLoopHint() for discussion on core-libs-dev and hotspot-dev. May be of interest to this group. The main focus is supporting outside-of-the-JDK spinning needs (for which there are multiple eager users), but it could/may be useful under the hood in j.u.c. _______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
A variant of synchronic for j.u.c would certainly be cool to have. Especially if it supports a hint that makes it actually spin forever rather than block (this may be what expect_urgent means, or maybe a dedicated spin level is needed). An implementation could use spinLoopHint() under the hood, or other things where appropriate (e.g. if MWAIT was usefully available in user mode in some future, and had a way to limit the wait time).
However, an abstraction like synchronic is a bit higher level than spinLoopHint(). One of the main drivers for spinLoopHint() is direct-use cases by programs and libraries outside of the core JDK. E.g. spinning indefinitely (or for limited periods) on dedicated vcores is a common practice in high performance messaging and communications stacks, as is not unreasonable on today's many-core systems. E.g. seeing 4-8 threads "pinned" with spinning loops is common place in trading applications, in kernel bypass network stacks, and in low latency messaging. And the conditions for spins are often more complicated than those expressible by synchronic (e.g. watching multiple addresses in a mux'ed spin). I'm sure a higher level abstraction for a spin wait can be enriched enough to come close, but there are many current use cases that aren't covered by any currently proposed abstraction. So, I like the idea of an abstraction that would allow uncomplicated spin-wait use, but I also think that direct access to spinLoopHint() is very much needed. They don't contradict each other. — Gil.
_______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
I am not fully up to speed on this topic. However, why not call
Thread.yield()? If there are no other threads waiting to get on the
processor, then Thread.yield() does nothing. The current thread
keeps executing. If there are threads waiting to get on the
processor, then current thread goes to the end of the run queue and
another thread gets on the processor (i.e. a context switch). The
thread will run again after the other threads ahead of it either
block, call yield() or use up their time slice. The only time
Thread.yield() will do anything is if *all* of the processors are
busy (i.e. 100% CPU utilization for the machine). You could run
1000s of threads in tight Thread.yield() loops and all of the
threads will take a turn to go around the loop one time and then go
to the end of the run queue.
I've tested this on Windows and Linux (Intel 64-bit processors). Some people are very afraid of context switches. They think that context switches are expensive. This was true of very old Linux kernels. Now a days, it costs 100s of nanoseconds to do a context switch. Of course, the cache may need to be reloaded with the data relevant for the running thread. -Nathan On 10/6/2015 11:56 AM, Gil Tene wrote:
A variant of synchronic for j.u.c would certainly be cool to have. Especially if it supports a hint that makes it actually spin forever rather than block (this may be what expect_urgent means, or maybe a dedicated spin level is needed). An implementation could use spinLoopHint() under the hood, or other things where appropriate (e.g. if MWAIT was usefully available in user mode in some future, and had a way to limit the wait time). _______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
When comparing spinLoopHint() to Thread.yield(), we're talking about different orders of magnitude, and different motivations.
On the motivation side: A major reason for using spinLoopHint() is to improve the reaction time of a spinning thread (from the time the event it is spinning for actually occurs until it actually reacts to it). Power savings is a another benefit. Thread.yield() doesn't help with either. On the orders of magnitude side: Thread.yield involves making a system call. This makes it literally 10x+ longer to react than spinning without it, and certainly pulls in the opposite direction of spinLoopHint().
_______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
My question about spinLoopHint() would be whether it can be defined in a way that it makes it useful across architectures. I vaguely remember seeing claims that even the x86 instructions are not implemented consistently enough to be easily usable in portable code. I have no idea (though I probably should) about ARM equivalents or the like. It also seems to me that unbounded spin loops are almost always a bad idea. (If you've been spinning for 10 seconds, you should be sleeping instead. You might even be inadvertently scheduled against the thread you're waiting for. Since you're waiting anyway, you might as well keep track of how long you've been spinning.) But the idea here would be that this is the low-level primitive you use if you haven't been spinning for very long? The alternative is to pass in some indication of how long you've been spinning, and have this yield, or sleep, after a sufficiently long time. Hans On Tue, Oct 6, 2015 at 6:41 PM, Gil Tene <[hidden email]> wrote:
_______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
On 10/08/2015 06:50 PM, Hans Boehm wrote:
> My question about spinLoopHint() would be whether it can be defined > in a way that it makes it useful across architectures. I vaguely > remember seeing claims that even the x86 instructions are not > implemented consistently enough to be easily usable in portable > code. I have no idea (though I probably should) about ARM > equivalents or the like. There are ARM equivalents defined in the architecture, but I don't know if they're much more than NOPs. > It also seems to me that unbounded spin loops are almost always a > bad idea. (If you've been spinning for 10 seconds, you should be > sleeping instead. You might even be inadvertently scheduled against > the thread you're waiting for. Since you're waiting anyway, you > might as well keep track of how long you've been spinning.) But the > idea here would be that this is the low-level primitive you use if > you haven't been spinning for very long? Right. I don't speak for Gil, but I don't think anyone is proposing to do any more than adding this hint to the spin loops that people use already. Andrew. _______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
In reply to this post by Gil Tene
Variable X transitions from value A to value B over time t.
What is the expected reaction time of a spinning thread? The answer is - it really depends on your cost model. If you are waiting for X to become B, you may be waiting for up to t units of time. What difference would it make in your cost model, if instead it waited for N% of t more? When N% of t becomes larger than time to switch context, you yield. But this is a selfish model (my wait is more important than letting the others use the CPU). Alex On 07/10/2015 02:41, Gil Tene wrote:
When comparing spinLoopHint() to Thread.yield(), we're talking about different orders of magnitude, and different motivations. _______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
In reply to this post by Nathan Reynolds-2
On 06/10/15 21:15, Nathan Reynolds wrote:
> Some people are very afraid of context switches. They think that > context switches are expensive. This was true of very old Linux > kernels. Now a days, it costs 100s of nanoseconds to do a context > switch. In practice people don't use threads as much as they could because of the cost of such switches. Say you've constructed a block of data and you want to encrypt it before saving it somewhere. What most people do today is call encrypt() synchronously. But chances are you have cores on the same machine which are stopped, so you could hand that task to another core. But to do that you have to signal to the stopped core, and the latency between a FUTEX_WAKE and a stopped thread starting is at least a couple of microseconds. You can encrypt at about 1ns/byte, so that's a couple of kbytes of encryption just to wake the thread. And of course there's the cache overhead too. In practice, all this latency means that it's not worth waking another core unless your block of data is pretty large. So how do you solve this problem? You spin. And then the time to start a waiting thread is not a couple of microseconds but tens of nanoseconds, the time it takes to encrypt tens of bytes. Andrew. _______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
In reply to this post by Gil Tene
How exactly does this work?
My understanding (very, very limited), was that MWAIT works with a memory address, pseudo: "continue execution upon a write to memory location X" , but the proposed spinLoopHint() doesn't take any argument. Is the idea that the JIT would somehow figure out the memory address in question? e.g., I looked at your SpinHintTests, how would the runtime "know" that #spinData was the memory address to monitor? |
On 11/10/15 17:42, thurstonn wrote:
> How exactly does this work? > My understanding (very, very limited), was that MWAIT works with a memory > address, pseudo: > "continue execution upon a write to memory location X" , > but the proposed spinLoopHint() doesn't take any argument. spinLoopHint() is just a PAUSE instruction. It's not an MWAIT. Andrew. _______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
In reply to this post by oleksandr otenko
On 10/10/2015 16:54, Gil Tene wrote:
ok, I'll rephrase my statement: "Variable X transitions from value A to value B over/during time dt." :-) It doesn't matter what the absolute value of t is. But if you observe value is not B, you are going to wait up to dt units of time - owing to the nature of the transition from A to B. Then in this world there is also some overhead from observing it is now B. Saying "make that overhead as small as possible" is not accurate. Saying "make that overhead less than 100 nanoseconds" is too strict - why would you care whether it is 100 nanoseconds, if dt is 10 milliseconds. Granted, there will be cases where you'd justify the "100 nanosecond" overhead even if dt is "10 ms", hence my remark that it really depends on what the cost function is, but the main consumer of concurrency primitives will want to relax the overhead to be some function of dt - since the average wait time is already a function of dt.
If we continue this analogy long enough, they do leave the checkout (I doubt it is for a smoke - more like to stack shelves), and the peers press the button to summon them back when getting congested. You may be able to optimize the levels of adrenaline in the customer's bloodstream, if they see the cashier race to the checkout (instead of leisurely walk).
A less important point - let's define reaction time. At the moment I am looking at it like so: we can't measure the time between events in two different threads, so we have a third timeline observing events in both threads. But there is no "reaction time" on it: | | | X A | | | XA+ | | | | B | | | B+ | | | B- | | | | | Y | | | | Y+ | | | Suppose for simplicity that X "started the wait to make progress" and A "started transition" are observed simultaneously at the point XA+. Suppose B "finished transition" is observed at B+. Suppose Y "responded to transition to B" is observed at point Y+. Suppose Y+ also tells the observer the time dy between B- "noticed B" and Y. Here "concurrency overhead" is perceived as (Y+)-dy-(B+). It is independent of XA+ and Y+, but whether it makes sense to reduce it really depends on the magnitude of Y-X, an estimate of which is (Y+)-(XA+), and on the cost function or SLA. You might say that "concurrency overhead" is the "reaction time", but it really is two or more "reaction times", even if you make the thread transitioning A to B the observer thread, instead of having a separate observer thread. A more important point - reducing the overheads makes sense when they constitute an important part of the overall time. Maybe you are promoting the "the wait time is so expensive" case. Adding support for that is a good thing. But most cases would want some back off according to cumulative wait time. Alex
_______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
In reply to this post by Andrew Haley
Andrew Haley wrote:
> On 11/10/15 17:42, thurstonn wrote: > > > How exactly does this work? > > My understanding (very, very limited), was that MWAIT works with > > a memory address, pseudo: > > "continue execution upon a write to memory location X" , > > but the proposed spinLoopHint() doesn't take any argument. > > spinLoopHint() is just a PAUSE instruction. It's not an MWAIT. Somewhere along the way, Doug had mentioned MWAIT as a different but related concept: PAUSE is to yield() as MWAIT is to park(). (And yes, the specific proposal for spinLoopHint() is to use PAUSE.) Cheers, Justin _______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
On 12/10/15 21:38, Justin Sampson wrote:
> Andrew Haley wrote: > >> On 11/10/15 17:42, thurstonn wrote: >> >>> How exactly does this work? >>> My understanding (very, very limited), was that MWAIT works with >>> a memory address, pseudo: >>> "continue execution upon a write to memory location X" , >>> but the proposed spinLoopHint() doesn't take any argument. >> >> spinLoopHint() is just a PAUSE instruction. It's not an MWAIT. > > Somewhere along the way, Doug had mentioned MWAIT as a different but > related concept: PAUSE is to yield() as MWAIT is to park(). That was me, really: I'm looking for a nice way to handle WFE on AArch64 and mentioned it on the HotSpot list. Hans's reference to Synchronic objects is interesting but I can't quite see how to make it fit Java. I'm wondering if a flyweight version of park() with a timeout might do the job, but it's not perfect because you can't communicate any information through a synchronization value. Still, it would be faster than what we have at the moment. Andrew. _______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
In reply to this post by Hans Boehm
It seems to me that the trick here is to be explicit as to what is intended. Presumably this is intended to discourage speculative execution across a spinLoopHint(). It is not intended to, for example, put the processor into some sort of sleep state for a while, though that might also make sense under slightly different circumstances. I would emphasize that this is expected not to increase latency. It might happen to reduce power consumption, but a power-reducing, latency-increasing implementation is not expected. On Sat, Oct 10, 2015 at 8:41 AM, Gil Tene <[hidden email]> wrote:
_______________________________________________ Concurrency-interest mailing list [hidden email] http://cs.oswego.edu/mailman/listinfo/concurrency-interest |
Free forum by Nabble | Edit this page |