Enforcing total sync order on modern hardware

classic Classic list List threaded Threaded
59 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Enforcing total sync order on modern hardware

Marko Topolnik-2
In an old post I drew the diagram below, which showed happens-before edges between three threads. The thread shown at the bottom was a "clock-updating" thread, continuously updating a volatile "current time" value. The other two threads read the time and were additionally coupled through a shared volatile variable with one writer and one reader thread.

My point was that the threads could behave in a paradoxical way from the standpoint of "global time": the reader could observe a late time value and nevertheless not be prevented from observing an early write to the shared var.

The JMM actually prevents this behavior with the enforcement of a total sync order.

However, during a private discussion with Martin Thompson, it seemed unclear how exactly a runtime would actually enforce total sync order without hurting performance. Given that, since Nehalem, cores communicate point-to-point over QPI and don't lock the global front-side bus, the CPU doesn't naturally offer a total ordering of all lock operations. I would be very interested to learn how exactly this goes on, or perhaps whether Martin and I were missing something and this is actually not an issue.

Here is the diagram reposted:


                                /--> Rrt6 --/-> Rrt9 --> Rv0
    ---------------------------+------------+--------/
  /                            |     ------/
Wv0 ---> Rwt3 -> Wv1 --> Rwt6  |    /
        /                /   --|   /
       |                | /       /
       T3 ------------> T6 ----> T9


T3, T6, T9 -- writes to currentTime
Rwt0, Rwt1 -- reads of currentTime by writing thread
Rrt1, Rrt2 -- reads of currentTime by reading thread
Wv0, Wv1   -- writes to the sharedVar
Rv0        -- read of the sharedVar

initially t = 0 ms;
T3 writes t = 3 ms;
T6 writes t = 6 ms;
T9 writes t = 9 ms.

The program outputs

Writing 0 at t = 0 ms
Writing 1 at t = 3 ms
Reading 0 at t = 9 ms


---
Marko Topolnik
Hazelcast


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Vitaly Davidovich
Marko,

Can't scheduling alone cause a situation where reader reads t=9ms, looks at the shared value, and sees something old that hasn't been updated yet since the writer hasn't observed the t=9ms yet (due to scheduling)? Nehalem is a TSO (total store order) architecture, which means each core's writes appear in the same order to all other cores, but there's no global order across all cores.  So your clock-updating thread's stores will appear to the reader/writer in the same order, but what can reader/writer say about the other if each reads t = X? I think nothing of consequence given they're not synchronizing-with the clock-updating thread, but simply observing that value in a "racy" (with respect to each other) manner.

I'm probably misunderstanding your question/point though, so please correct me.

On Mon, Mar 16, 2015 at 1:00 PM, Marko Topolnik <[hidden email]> wrote:
In an old post I drew the diagram below, which showed happens-before edges between three threads. The thread shown at the bottom was a "clock-updating" thread, continuously updating a volatile "current time" value. The other two threads read the time and were additionally coupled through a shared volatile variable with one writer and one reader thread.

My point was that the threads could behave in a paradoxical way from the standpoint of "global time": the reader could observe a late time value and nevertheless not be prevented from observing an early write to the shared var.

The JMM actually prevents this behavior with the enforcement of a total sync order.

However, during a private discussion with Martin Thompson, it seemed unclear how exactly a runtime would actually enforce total sync order without hurting performance. Given that, since Nehalem, cores communicate point-to-point over QPI and don't lock the global front-side bus, the CPU doesn't naturally offer a total ordering of all lock operations. I would be very interested to learn how exactly this goes on, or perhaps whether Martin and I were missing something and this is actually not an issue.

Here is the diagram reposted:


                                /--> Rrt6 --/-> Rrt9 --> Rv0
    ---------------------------+------------+--------/
  /                            |     ------/
Wv0 ---> Rwt3 -> Wv1 --> Rwt6  |    /
        /                /   --|   /
       |                | /       /
       T3 ------------> T6 ----> T9


T3, T6, T9 -- writes to currentTime
Rwt0, Rwt1 -- reads of currentTime by writing thread
Rrt1, Rrt2 -- reads of currentTime by reading thread
Wv0, Wv1   -- writes to the sharedVar
Rv0        -- read of the sharedVar

initially t = 0 ms;
T3 writes t = 3 ms;
T6 writes t = 6 ms;
T9 writes t = 9 ms.

The program outputs

Writing 0 at t = 0 ms
Writing 1 at t = 3 ms
Reading 0 at t = 9 ms


---
Marko Topolnik
Hazelcast


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Andrew Haley
In reply to this post by Marko Topolnik-2
On 03/16/2015 05:00 PM, Marko Topolnik wrote:

> Given that, since Nehalem, cores communicate point-to-point over QPI
> and don't lock the global front-side bus, the CPU doesn't naturally
> offer a total ordering of all lock operations.

Intel do actually guarantee

    Locked instructions have a total order.

so this is a hardware problem, not a software one.  How exactly the
hardware people do this on a large network of processors is some of
the most Secret Sauce, but I can imagine some kind of combining
network in hardware.

Andrew.

[1]  Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A, 3B & 3C): System Programming Guide 8.2.2, Memory
Ordering in P6 and More Recent Processor Families
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Marko Topolnik-2
In reply to this post by Vitaly Davidovich
Vitaly,

if you refer back to the diagram, the writer thread:

1. wrote 0 to the shared var;
2. observed t = 3ms;
3. wrote 1 to the shared var;
4. observed t = 6 ms. 

The reader thread:
1. observed t = 6 ms;
2. observed t = 9 ms;
3. read 0 from the shared var.

This should eliminate the effects of scheduling.

The point is, the above is in violation of the Java Memory Model because no total synchronization order is possible under which this result would be obtained. On the other hand, it is happens before-consistent.

On Mon, Mar 16, 2015 at 7:35 PM, Vitaly Davidovich <[hidden email]> wrote:
Marko,

Can't scheduling alone cause a situation where reader reads t=9ms, looks at the shared value, and sees something old that hasn't been updated yet since the writer hasn't observed the t=9ms yet (due to scheduling)? Nehalem is a TSO (total store order) architecture, which means each core's writes appear in the same order to all other cores, but there's no global order across all cores.  So your clock-updating thread's stores will appear to the reader/writer in the same order, but what can reader/writer say about the other if each reads t = X? I think nothing of consequence given they're not synchronizing-with the clock-updating thread, but simply observing that value in a "racy" (with respect to each other) manner.

I'm probably misunderstanding your question/point though, so please correct me.

On Mon, Mar 16, 2015 at 1:00 PM, Marko Topolnik <[hidden email]> wrote:
In an old post I drew the diagram below, which showed happens-before edges between three threads. The thread shown at the bottom was a "clock-updating" thread, continuously updating a volatile "current time" value. The other two threads read the time and were additionally coupled through a shared volatile variable with one writer and one reader thread.

My point was that the threads could behave in a paradoxical way from the standpoint of "global time": the reader could observe a late time value and nevertheless not be prevented from observing an early write to the shared var.

The JMM actually prevents this behavior with the enforcement of a total sync order.

However, during a private discussion with Martin Thompson, it seemed unclear how exactly a runtime would actually enforce total sync order without hurting performance. Given that, since Nehalem, cores communicate point-to-point over QPI and don't lock the global front-side bus, the CPU doesn't naturally offer a total ordering of all lock operations. I would be very interested to learn how exactly this goes on, or perhaps whether Martin and I were missing something and this is actually not an issue.

Here is the diagram reposted:


                                /--> Rrt6 --/-> Rrt9 --> Rv0
    ---------------------------+------------+--------/
  /                            |     ------/
Wv0 ---> Rwt3 -> Wv1 --> Rwt6  |    /
        /                /   --|   /
       |                | /       /
       T3 ------------> T6 ----> T9


T3, T6, T9 -- writes to currentTime
Rwt0, Rwt1 -- reads of currentTime by writing thread
Rrt1, Rrt2 -- reads of currentTime by reading thread
Wv0, Wv1   -- writes to the sharedVar
Rv0        -- read of the sharedVar

initially t = 0 ms;
T3 writes t = 3 ms;
T6 writes t = 6 ms;
T9 writes t = 9 ms.

The program outputs

Writing 0 at t = 0 ms
Writing 1 at t = 3 ms
Reading 0 at t = 9 ms


---
Marko Topolnik
Hazelcast


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest




_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Marko Topolnik-2
In reply to this post by Andrew Haley
Andrew,

thank you for the reference, this answers the dilemma in full. I didn't know this guarantee existed on x86.

---
Marko

On Mon, Mar 16, 2015 at 7:44 PM, Andrew Haley <[hidden email]> wrote:
On 03/16/2015 05:00 PM, Marko Topolnik wrote:

> Given that, since Nehalem, cores communicate point-to-point over QPI
> and don't lock the global front-side bus, the CPU doesn't naturally
> offer a total ordering of all lock operations.

Intel do actually guarantee

    Locked instructions have a total order.

so this is a hardware problem, not a software one.  How exactly the
hardware people do this on a large network of processors is some of
the most Secret Sauce, but I can imagine some kind of combining
network in hardware.

Andrew.

[1]  Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A, 3B & 3C): System Programming Guide 8.2.2, Memory
Ordering in P6 and More Recent Processor Families


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Vitaly Davidovich

Why were you concerned with lock instructions specifically? At one point in the past, volatile writes were done using mfence, IIRC.

sent from my phone

On Mar 16, 2015 3:28 PM, "Marko Topolnik" <[hidden email]> wrote:
Andrew,

thank you for the reference, this answers the dilemma in full. I didn't know this guarantee existed on x86.

---
Marko

On Mon, Mar 16, 2015 at 7:44 PM, Andrew Haley <[hidden email]> wrote:
On 03/16/2015 05:00 PM, Marko Topolnik wrote:

> Given that, since Nehalem, cores communicate point-to-point over QPI
> and don't lock the global front-side bus, the CPU doesn't naturally
> offer a total ordering of all lock operations.

Intel do actually guarantee

    Locked instructions have a total order.

so this is a hardware problem, not a software one.  How exactly the
hardware people do this on a large network of processors is some of
the most Secret Sauce, but I can imagine some kind of combining
network in hardware.

Andrew.

[1]  Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A, 3B & 3C): System Programming Guide 8.2.2, Memory
Ordering in P6 and More Recent Processor Families


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Marko Topolnik-2
What is important is that there be _some_ way of guaranteeing total sync order at the CPU level. It is less important whether this is achieved by mfence or lock instruction.

-Marko

On Mon, Mar 16, 2015 at 8:40 PM, Vitaly Davidovich <[hidden email]> wrote:

Why were you concerned with lock instructions specifically? At one point in the past, volatile writes were done using mfence, IIRC.

sent from my phone

On Mar 16, 2015 3:28 PM, "Marko Topolnik" <[hidden email]> wrote:
Andrew,

thank you for the reference, this answers the dilemma in full. I didn't know this guarantee existed on x86.

---
Marko

On Mon, Mar 16, 2015 at 7:44 PM, Andrew Haley <[hidden email]> wrote:
On 03/16/2015 05:00 PM, Marko Topolnik wrote:

> Given that, since Nehalem, cores communicate point-to-point over QPI
> and don't lock the global front-side bus, the CPU doesn't naturally
> offer a total ordering of all lock operations.

Intel do actually guarantee

    Locked instructions have a total order.

so this is a hardware problem, not a software one.  How exactly the
hardware people do this on a large network of processors is some of
the most Secret Sauce, but I can imagine some kind of combining
network in hardware.

Andrew.

[1]  Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A, 3B & 3C): System Programming Guide 8.2.2, Memory
Ordering in P6 and More Recent Processor Families


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Vitaly Davidovich
By "total sync order at the CPU level" do you mean sync order of that cpu itself or some global order across all CPUs? The total order of lock instructions is across all CPUs, whereas mfence, AFAIK, orders only the local CPU's memory operations.  Sorry, maybe I'm being dense today, but I still don't get why knowing that lock instructions have total order somehow answers your question.  Were you simply asking whether there are instructions available to ensure memory operations are done (or appear to be) in program (minus compiler code motion) order on a per-cpu basis? 

On Mon, Mar 16, 2015 at 3:46 PM, Marko Topolnik <[hidden email]> wrote:
What is important is that there be _some_ way of guaranteeing total sync order at the CPU level. It is less important whether this is achieved by mfence or lock instruction.

-Marko

On Mon, Mar 16, 2015 at 8:40 PM, Vitaly Davidovich <[hidden email]> wrote:

Why were you concerned with lock instructions specifically? At one point in the past, volatile writes were done using mfence, IIRC.

sent from my phone

On Mar 16, 2015 3:28 PM, "Marko Topolnik" <[hidden email]> wrote:
Andrew,

thank you for the reference, this answers the dilemma in full. I didn't know this guarantee existed on x86.

---
Marko

On Mon, Mar 16, 2015 at 7:44 PM, Andrew Haley <[hidden email]> wrote:
On 03/16/2015 05:00 PM, Marko Topolnik wrote:

> Given that, since Nehalem, cores communicate point-to-point over QPI
> and don't lock the global front-side bus, the CPU doesn't naturally
> offer a total ordering of all lock operations.

Intel do actually guarantee

    Locked instructions have a total order.

so this is a hardware problem, not a software one.  How exactly the
hardware people do this on a large network of processors is some of
the most Secret Sauce, but I can imagine some kind of combining
network in hardware.

Andrew.

[1]  Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A, 3B & 3C): System Programming Guide 8.2.2, Memory
Ordering in P6 and More Recent Processor Families


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest




_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Marko Topolnik-2
I was wondering how a JVM implementation would be able to enforce global sync order across all CPUs in a performant way. So, given the total ordering on lock instructions, I would assume that the implementation of any given synchronizing action would have to involve a lock instruction at some point.

On Mon, Mar 16, 2015 at 9:02 PM, Vitaly Davidovich <[hidden email]> wrote:
By "total sync order at the CPU level" do you mean sync order of that cpu itself or some global order across all CPUs? The total order of lock instructions is across all CPUs, whereas mfence, AFAIK, orders only the local CPU's memory operations.  Sorry, maybe I'm being dense today, but I still don't get why knowing that lock instructions have total order somehow answers your question.  Were you simply asking whether there are instructions available to ensure memory operations are done (or appear to be) in program (minus compiler code motion) order on a per-cpu basis? 

On Mon, Mar 16, 2015 at 3:46 PM, Marko Topolnik <[hidden email]> wrote:
What is important is that there be _some_ way of guaranteeing total sync order at the CPU level. It is less important whether this is achieved by mfence or lock instruction.

-Marko

On Mon, Mar 16, 2015 at 8:40 PM, Vitaly Davidovich <[hidden email]> wrote:

Why were you concerned with lock instructions specifically? At one point in the past, volatile writes were done using mfence, IIRC.

sent from my phone

On Mar 16, 2015 3:28 PM, "Marko Topolnik" <[hidden email]> wrote:
Andrew,

thank you for the reference, this answers the dilemma in full. I didn't know this guarantee existed on x86.

---
Marko

On Mon, Mar 16, 2015 at 7:44 PM, Andrew Haley <[hidden email]> wrote:
On 03/16/2015 05:00 PM, Marko Topolnik wrote:

> Given that, since Nehalem, cores communicate point-to-point over QPI
> and don't lock the global front-side bus, the CPU doesn't naturally
> offer a total ordering of all lock operations.

Intel do actually guarantee

    Locked instructions have a total order.

so this is a hardware problem, not a software one.  How exactly the
hardware people do this on a large network of processors is some of
the most Secret Sauce, but I can imagine some kind of combining
network in hardware.

Andrew.

[1]  Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A, 3B & 3C): System Programming Guide 8.2.2, Memory
Ordering in P6 and More Recent Processor Families


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest





_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Marko Topolnik-2
There is another concern that may be interesting to reconsider. Given the lack of total sync order when just using memory barriers, is the JSR-133 Cookbook wrong/outdated in this respect? It doesn't at all deal with the issue of the sync order, just with the visibility of inter-thread actions.

If a JVM just followed the rules put forth by the Cookbook, would an invalid execution as outlined in my diagram actually be possible?

On Mon, Mar 16, 2015 at 9:22 PM, Marko Topolnik <[hidden email]> wrote:
I was wondering how a JVM implementation would be able to enforce global sync order across all CPUs in a performant way. So, given the total ordering on lock instructions, I would assume that the implementation of any given synchronizing action would have to involve a lock instruction at some point.

On Mon, Mar 16, 2015 at 9:02 PM, Vitaly Davidovich <[hidden email]> wrote:
By "total sync order at the CPU level" do you mean sync order of that cpu itself or some global order across all CPUs? The total order of lock instructions is across all CPUs, whereas mfence, AFAIK, orders only the local CPU's memory operations.  Sorry, maybe I'm being dense today, but I still don't get why knowing that lock instructions have total order somehow answers your question.  Were you simply asking whether there are instructions available to ensure memory operations are done (or appear to be) in program (minus compiler code motion) order on a per-cpu basis? 

On Mon, Mar 16, 2015 at 3:46 PM, Marko Topolnik <[hidden email]> wrote:
What is important is that there be _some_ way of guaranteeing total sync order at the CPU level. It is less important whether this is achieved by mfence or lock instruction.

-Marko

On Mon, Mar 16, 2015 at 8:40 PM, Vitaly Davidovich <[hidden email]> wrote:

Why were you concerned with lock instructions specifically? At one point in the past, volatile writes were done using mfence, IIRC.

sent from my phone

On Mar 16, 2015 3:28 PM, "Marko Topolnik" <[hidden email]> wrote:
Andrew,

thank you for the reference, this answers the dilemma in full. I didn't know this guarantee existed on x86.

---
Marko

On Mon, Mar 16, 2015 at 7:44 PM, Andrew Haley <[hidden email]> wrote:
On 03/16/2015 05:00 PM, Marko Topolnik wrote:

> Given that, since Nehalem, cores communicate point-to-point over QPI
> and don't lock the global front-side bus, the CPU doesn't naturally
> offer a total ordering of all lock operations.

Intel do actually guarantee

    Locked instructions have a total order.

so this is a hardware problem, not a software one.  How exactly the
hardware people do this on a large network of processors is some of
the most Secret Sauce, but I can imagine some kind of combining
network in hardware.

Andrew.

[1]  Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3 (3A, 3B & 3C): System Programming Guide 8.2.2, Memory
Ordering in P6 and More Recent Processor Families


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest






_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Aleksey Shipilev-2
On 17.03.2015 9:31, Marko Topolnik wrote:
> There is another concern that may be interesting to reconsider. Given
> the lack of total sync order when just using memory barriers, is the
> JSR-133 Cookbook wrong/outdated in this respect? It doesn't at all deal
> with the issue of the sync order, just with the visibility of
> inter-thread actions.

Sigh. The rule of thumb in dealing with concurrency: assume you are
misunderstanding something before you think something agreed upon is
incorrect. In fact, in this thread, you confused yourselves with "memory
barriers are lacking total order". Memory barriers *are* regaining the
total ordering when used correctly. E.g. on x86, this is a sequentially
consistent code:

  SeqCst store:
    mov %blah, (memory)   ; volatile store
    lock addl (%esp), 0

  SeqCst load:
    mov (memory), %blah   ; volatile load

The alternative also works:

  SeqCst store:
    mov %blah, (memory)   ; volatile store
    mfence

  SeqCst load:
    mov (memory), %blah   ; volatile load


The mental model I am having in my head is as follows:

  a) Cache-coherent systems maintain the consistent (coherent) view of
each memory location at any given moment. In fact, most coherency
protocols provide the total order for the operations on a *single*
location. Regardless how the actual interconnect is operating, the cache
coherency protocols are to maintain that illusion. MESI-like protocols
are by nature message-based, and so they do not require shared bus to
begin with, so no problems with QPI.

  b) CPUs can do dirty work either doing multiple accesses at once, or
even without consulting the coherency. That has a potential to break the
invariants the coherency is trying to maintain: either through
overlapping the accesses, or not exposing the values right away. Memory
barriers are here to instruct hardware to stop being smart and expose
the operations to coherency in the instruction order.

  c) The easiest way to provide the memory fence effect is force CPU to
wait for completion of fenced operations, without doing anything else.
What fences are doing is (almost) exactly that: they let coherency to
sort out the final value of the memory location without overlapping the
access with anything else.

E.g. on x86, where total store order is provided by hardware, the
fencing is required to avoid store forwarding effects. With store
forwarding and no fencing, the writer CPU can "see" the written value
before anything else sees it, and can go on pretending everyone else saw
that value as well. This shortcut sets us up for potentially observing
the non-total ordering:

Dekker idiom:

   volatile int x, y;
 -----------------------
  x = 1        y = 1
  r1 = y       r2 = x

(r1, r2) = (0, 0) is forbidden, non-SC.

Imagine CPU1 does the "x = 1" store, and it is on its way to coherency,
now in store buffer. CPU1 then proceeds to read "y", and imagine it
reads "0", because CPU2 is lagging behind. Now, CPU2 does "y = 1" store.
Then CPU2 does "r2 = x" load, and reads "0" (!), because the "x" value
is still stuck in CPU1's store buffer. This is a "transient" condition
when coherency has no idea CPUs have different ideas about the value of
"x", because all values have not yet reached coherency. Locked
instruction / mfence after "x = 1" write basically says "until we are
sure "x = 1" had completed, freeze and don't make any moves."

Even without store buffers, when CPUs can start writing "x = 1" and "y =
1" at the same time, waiting for their local operations to complete
saves us from failures due to overlapped in-flight updates of "x" and
"y". In other words, CPU1 would not be able to start loading "y" before
it completed "x = 1", which means that whatever value of "y" it reads
after completing the write of "x" while CPU1's write of "y = 1" is still
pending, CPU2 is ought to observe "x = 1". Memory barriers provide the
"lock-step" behavior here.

The same line of reasoning applies to non-TSO hardware,
non-multiple-copy-atomic hardware, etc. Memory barriers are routinely
used to regain the order of operations. These things are
well-researched, so instead of relying on mailing list hearsays, see the
authoritative sources here:

http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf
  http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
  http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
  http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html


> If a JVM just followed the rules put forth by the Cookbook, would an
> invalid execution as outlined in my diagram actually be possible?

This is the monospace-friendly version:
 http://cs.oswego.edu/pipermail/concurrency-interest/2015-March/014118.html

Thinking on hardware level is full of surprises, and one should only do
that very carefully.

So, if only the "currentTime" is volatile (sequentially consistent), you
can still read "0" from non-volatile "sharedVar" in Rv0. The easiest way
to achieve that hardware wise is to get Wv1 to stuck in store buffer,
voila. It does not depend on whether the loads and stores to
"currentTime" are totally ordered or not. Even if they are, Wv1 can
still move past Rwt6.

If "sharedVar" is also volatile (sequentially consistent), then Wv1
would complete before reading Rwt6. Reading Rwt6 after the write means
the write is observable near tick 6: it is plausible the clock ticked 6
before we were writing; it is plausible the clock ticked 6 right after
we did the write. Which *really* means the write is guaranteed to be
observable at the *next* tick, T9, since "currentTime" reads/writes are
totally ordered. Therefore, once the reader thread observed t=9, it
should also observe the Wv1, rendering Rv0 reading "0" incorrect.

                                Rrt9 ---> Rv0
  Wv0 --> Wv1 --> Rwt6           ^
         .---------^         .---/
       T6 ---------------> T9

 "global time" -------------------------------->


Notice how this relies on the writer thread to observe Rwt6! That's a
reference frame for you. If writer was to observe Rwt9, you might have
plausibly inferred the Wv1 may be not visible at Rv0:

            Rrt9 ---> Rv0
              ^
  Wv0 --------+---------> Wv1 --> Rwt9
         .----/--------------------^
       T9

 "global time" -------------------------------->

It inherently relies on T6 updates to wait for everyone to see the
update. In other words, guaranteeing every thread sees the consistent
progression of values in some "global time".

Thanks,
-Aleksey.



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Marko Topolnik-2
On Tue, Mar 17, 2015 at 11:46 AM, Aleksey Shipilev <[hidden email]> wrote:
On 17.03.2015 9:31, Marko Topolnik wrote:
> There is another concern that may be interesting to reconsider. Given
> the lack of total sync order when just using memory barriers, is the
> JSR-133 Cookbook wrong/outdated in this respect? It doesn't at all deal
> with the issue of the sync order, just with the visibility of
> inter-thread actions.

The mental model I am having in my head is as follows:

  a) Cache-coherent systems maintain the consistent (coherent) view of
each memory location at any given moment. In fact, most coherency
protocols provide the total order for the operations on a *single*
location. Regardless how the actual interconnect is operating, the cache
coherency protocols are to maintain that illusion. MESI-like protocols
are by nature message-based, and so they do not require shared bus to
begin with, so no problems with QPI.

So let's fix the following total order on currentTime:

T3 -> Rwt3 -> T6 -> Rwt6 -> Rrt6 -> T9 -> Rrt9
 
If "sharedVar" is also volatile (sequentially consistent), then Wv1
would complete before reading Rwt6.

OK, but this wouldn't necessarily happen on a unique global timescale: the "writing" thread would have the ordering Wv1 -> Rwt6; there would be an _independent_ total order of actions on currentTime, and a third, again independent order of actions by the "reading" thread. Due to the distributed nature of coherence the fact that, on one core, Wv1 precedes Rwt6 does not enforce Rrt6 -> Rv1 on another core. It is not obvious that there is transitivity between these individual orders.

Particularly note this statement in http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf:

"[the CPU vendor specifications] admit the IRIW behaviour above but, under reasonable assumptions on the strongest x86 memory barrier, MFENCE, adding MFENCEs would not suffice to recover sequential consistency (instead, one would have to make liberal use of x86 LOCK’d instructions). Here the specifications seem to be much looser than the behaviour of implemented processors: to the best of our knowledge, and following some testing, IRIW is not observable in practice, even without MFENCEs. It appears that some JVM implementations depend on this fact, and would not be correct if one assumed only the IWP/AMD3.14/x86-CC architecture."

Also, for the newer revision of Intel's specification, “P6. In a multiprocessor system, stores to the same location have a total order” has been replaced by: “Any two stores are seen in a consistent order by processors other than those performing the stores.”

So here's a consistent order seen by all the processors except those running the two writing threads:

Wv0 -> T3 -> T6 -> T9 -> Wv1

This also respects the total ordering for each individual site, and a total ordering of each individual processor's stores. The "reading" thread inserts its Rv0 between T9 and Wv1.

 
Reading Rwt6 after the write means
the write is observable near tick 6: it is plausible the clock ticked 6
before we were writing; it is plausible the clock ticked 6 right after
we did the write. Which *really* means the write is guaranteed to be
observable at the *next* tick, T9, since "currentTime" reads/writes are
totally ordered. Therefore, once the reader thread observed t=9, it
should also observe the Wv1, rendering Rv0 reading "0" incorrect.

                                Rrt9 ---> Rv0
  Wv0 --> Wv1 --> Rwt6           ^
         .---------^         .---/
       T6 ---------------> T9

 "global time" -------------------------------->


Notice how this relies on the writer thread to observe Rwt6! That's a
reference frame for you. If writer was to observe Rwt9, you might have
plausibly inferred the Wv1 may be not visible at Rv0:

Thanks, that was precisely my motivation to add Rwt6 :)
 
---
Marko

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Vitaly Davidovich

The Intel spec update is to account for store buffers in the absence of fences.

In your "inserts a Rv0" example aren't we back to scheduling? There's no happens before relationship between the time value and the two other threads.  Making the time writes volatile translates into those writes being visible to all other cores before the time thread progresses - when that time value is observed is up to those other cores.  If the two threads communicated via piggybacking on the time value, say writer writes Svar=9 only when t=9 then reader is guaranteed to see at least Svar=9 if they also see t=9 (on x86), assuming Svar is volatile.  But as your example stands, you have some value being written globally every so often, and then two other threads peeking and poking at it while doing their own thing.

sent from my phone

On Mar 17, 2015 10:03 AM, "Marko Topolnik" <[hidden email]> wrote:
On Tue, Mar 17, 2015 at 11:46 AM, Aleksey Shipilev <[hidden email]> wrote:
On 17.03.2015 9:31, Marko Topolnik wrote:
> There is another concern that may be interesting to reconsider. Given
> the lack of total sync order when just using memory barriers, is the
> JSR-133 Cookbook wrong/outdated in this respect? It doesn't at all deal
> with the issue of the sync order, just with the visibility of
> inter-thread actions.

The mental model I am having in my head is as follows:

  a) Cache-coherent systems maintain the consistent (coherent) view of
each memory location at any given moment. In fact, most coherency
protocols provide the total order for the operations on a *single*
location. Regardless how the actual interconnect is operating, the cache
coherency protocols are to maintain that illusion. MESI-like protocols
are by nature message-based, and so they do not require shared bus to
begin with, so no problems with QPI.

So let's fix the following total order on currentTime:

T3 -> Rwt3 -> T6 -> Rwt6 -> Rrt6 -> T9 -> Rrt9
 
If "sharedVar" is also volatile (sequentially consistent), then Wv1
would complete before reading Rwt6.

OK, but this wouldn't necessarily happen on a unique global timescale: the "writing" thread would have the ordering Wv1 -> Rwt6; there would be an _independent_ total order of actions on currentTime, and a third, again independent order of actions by the "reading" thread. Due to the distributed nature of coherence the fact that, on one core, Wv1 precedes Rwt6 does not enforce Rrt6 -> Rv1 on another core. It is not obvious that there is transitivity between these individual orders.

Particularly note this statement in http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf:

"[the CPU vendor specifications] admit the IRIW behaviour above but, under reasonable assumptions on the strongest x86 memory barrier, MFENCE, adding MFENCEs would not suffice to recover sequential consistency (instead, one would have to make liberal use of x86 LOCK’d instructions). Here the specifications seem to be much looser than the behaviour of implemented processors: to the best of our knowledge, and following some testing, IRIW is not observable in practice, even without MFENCEs. It appears that some JVM implementations depend on this fact, and would not be correct if one assumed only the IWP/AMD3.14/x86-CC architecture."

Also, for the newer revision of Intel's specification, “P6. In a multiprocessor system, stores to the same location have a total order” has been replaced by: “Any two stores are seen in a consistent order by processors other than those performing the stores.”

So here's a consistent order seen by all the processors except those running the two writing threads:

Wv0 -> T3 -> T6 -> T9 -> Wv1

This also respects the total ordering for each individual site, and a total ordering of each individual processor's stores. The "reading" thread inserts its Rv0 between T9 and Wv1.

 
Reading Rwt6 after the write means
the write is observable near tick 6: it is plausible the clock ticked 6
before we were writing; it is plausible the clock ticked 6 right after
we did the write. Which *really* means the write is guaranteed to be
observable at the *next* tick, T9, since "currentTime" reads/writes are
totally ordered. Therefore, once the reader thread observed t=9, it
should also observe the Wv1, rendering Rv0 reading "0" incorrect.

                                Rrt9 ---> Rv0
  Wv0 --> Wv1 --> Rwt6           ^
         .---------^         .---/
       T6 ---------------> T9

 "global time" -------------------------------->


Notice how this relies on the writer thread to observe Rwt6! That's a
reference frame for you. If writer was to observe Rwt9, you might have
plausibly inferred the Wv1 may be not visible at Rv0:

Thanks, that was precisely my motivation to add Rwt6 :)
 
---
Marko

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Aleksey Shipilev-2
In reply to this post by Marko Topolnik-2
On 03/17/2015 04:39 PM, Marko Topolnik wrote:

> On Tue, Mar 17, 2015 at 11:46 AM, Aleksey Shipilev
> <[hidden email] <mailto:[hidden email]>> wrote:
> So let's fix the following total order on currentTime:
>
> T3 -> Rwt3 -> T6 -> Rwt6 -> Rrt6 -> T9 -> Rrt9
>  
>
>     If "sharedVar" is also volatile (sequentially consistent), then Wv1
>     would complete before reading Rwt6.
>
>
> OK, but this wouldn't necessarily happen on a unique global timescale:
> the "writing" thread would have the ordering Wv1 -> Rwt6; there would be
> an _independent_ total order of actions on currentTime, and a third,
> again independent order of actions by the "reading" thread. Due to the
> distributed nature of coherence the fact that, on one core, Wv1 precedes
> Rwt6 does not enforce Rrt6 -> Rv1 on another core. It is not obvious
> that there is transitivity between these individual orders.
No one fully understands how exactly the hardware concurrency works. All
the academic papers you and me cited before are trying to recover the
actual semantics from the experiments and anecdotal examples. Therefore,
lacking the clearly-specified model, we can only come up with some model
and explain the behavior there.

I offered the simplistic mental model where fences expose the values to
global coherency in instruction order, and thus regain the total order.
When you say "wouldn't necessarily happen on a unique global timescale"
or "due to distributed nature of coherence", you seem to be implying
some other model, but that definition is too vague to be useful. Indeed,
I can come up with an arbitrarily weak model that allows pretty much
anything.


> Particularly note this statement
> in http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
> <http://www.cl.cam.ac.uk/%7Epes20/weakmemory/cacm.pdf>:
>
> "[the CPU vendor specifications] admit the IRIW behaviour above but,
> under reasonable assumptions on the strongest x86 memory barrier,
> MFENCE, adding MFENCEs would not suffice to recover sequential
> consistency (instead, one would have to make liberal use of x86 LOCK’d
> instructions). Here the specifications seem to be much looser than the
> behaviour of implemented processors: to the best of our knowledge, and
> following some testing, IRIW is not observable in practice, even without
> MFENCEs. It appears that some JVM implementations depend on this fact,
> and would not be correct if one assumed only the IWP/AMD3.14/x86-CC
> architecture."

http://www0.cs.ucl.ac.uk/staff/j.alglave/papers/popl09.pdf has a
discussion about the suspected semantics of MFENCE, and they argue
MFENCEs are not enough if you are being paranoid about this. They also
argue Intel SDM can be read as if MFENCEs are actually serialized.

My example above also implied the fences are globally serialized. This
dubs the actual practice in both C/C++11 and JVM mappings, that assume
x86 MFENCE-s are globally serialized and MFENCE/LOCK-insns can be used
interchangeably.

POWER's hwsync seems to provide SC guarantees in the similar fashion,
"Regaining sequential consistency (SC) using sync: If one adds a sync
between every program-order pair of instructions (creating tests
SB+syncs, MP+syncs, WRC+syncs, and IRIW+syncs), then all the non-SC
results above are forbidden,"
(http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/pldi105-sarkar.pdf).

These intricate details on hardware concurrency is one of the reasons we
have JMM. JMM explicitly requires the total order on synchronization
actions, and JVM implementors are responsible to figure out how to map
it back on hardware, let it be fences or locked instructions or what.

Aside: The IRIW example on POWER requires hwsyncs before the volatile
reads, which incurs performance penalties. The discussion on whether
IRIW (and more broadly, total order that mandates IRIW) is something
that we should care about in a language memory model, is way, way above
most people's heads, including mine. Until there is an expert consensus
on the topic, laymans we are should not try to confuse ourselves with
it, before reading up and fully understanding the academic trail on the
topic.


> Also, for the newer revision of Intel's specification, “P6. In a
> multiprocessor system, stores to the same location have a total order”
> has been replaced by: “Any two stores are seen in a consistent order by
> processors other than those performing the stores.”

I think that provision is to allow store forwarding, as I described in
my original reply with Dekker idiom. MFENCE should bring the total store
order guarantees back even for the "writer" processors, since as per
Intel SDM:

"This serializing operation guarantees that every load and store
instruction that precedes the MFENCE instruction in program order
becomes *globally visible* before any load or store instruction that
follows the MFENCE instruction."

In other words, the relaxation for "writer" processors is the relaxation
for the regular TSO, not when you are using fences to force the order.


> So here's a consistent order seen by all the processors except those
> running the two writing threads:
>
> Wv0 -> T3 -> T6 -> T9 -> Wv1
>
> This also respects the total ordering for each individual site, and a
> total ordering of each individual processor's stores. The "reading"
> thread inserts its Rv0 between T9 and Wv1.

I don't understand this example. If Wv1 is serialized with fence, and
there is a read from "t" following it, that read cannot be Rwt6 then, as
it should see t=9. That pretty much deconstructs the original "execution".


-Aleksey.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Vitaly Davidovich

I suspect that part of the confusion is caused by Intel talking about locked instructions as having total order.  I'm not sure why they call them out like this, but I suspect it's because the CPUs have to arbitrate at who acquires the cacheline in exclusive mode if this happens to run at exactly the same time.  Otherwise, I don't see it any different than normal cache coherence mechanics that prevent multiple CPUs performing an update to the same line concurrently.

Also, total order means each CPUs stores appear in program (and same) order to all other CPUs (modulo store buffer forwarding).  But to me this means there're multiple ordered streams proceeding concurrently - everyone observing these stores sees them at the same time (if read at the same time) and in same order.  But the system isn't moving in some wall-clock lock step manner unless code uses happens before conditions to make progress by observing phase changes.

sent from my phone

On Mar 17, 2015 11:20 AM, "Aleksey Shipilev" <[hidden email]> wrote:
On 03/17/2015 04:39 PM, Marko Topolnik wrote:
> On Tue, Mar 17, 2015 at 11:46 AM, Aleksey Shipilev
> <[hidden email] <mailto:[hidden email]>> wrote:
> So let's fix the following total order on currentTime:
>
> T3 -> Rwt3 -> T6 -> Rwt6 -> Rrt6 -> T9 -> Rrt9
>
>
>     If "sharedVar" is also volatile (sequentially consistent), then Wv1
>     would complete before reading Rwt6.
>
>
> OK, but this wouldn't necessarily happen on a unique global timescale:
> the "writing" thread would have the ordering Wv1 -> Rwt6; there would be
> an _independent_ total order of actions on currentTime, and a third,
> again independent order of actions by the "reading" thread. Due to the
> distributed nature of coherence the fact that, on one core, Wv1 precedes
> Rwt6 does not enforce Rrt6 -> Rv1 on another core. It is not obvious
> that there is transitivity between these individual orders.

No one fully understands how exactly the hardware concurrency works. All
the academic papers you and me cited before are trying to recover the
actual semantics from the experiments and anecdotal examples. Therefore,
lacking the clearly-specified model, we can only come up with some model
and explain the behavior there.

I offered the simplistic mental model where fences expose the values to
global coherency in instruction order, and thus regain the total order.
When you say "wouldn't necessarily happen on a unique global timescale"
or "due to distributed nature of coherence", you seem to be implying
some other model, but that definition is too vague to be useful. Indeed,
I can come up with an arbitrarily weak model that allows pretty much
anything.


> Particularly note this statement
> in http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf
> <http://www.cl.cam.ac.uk/%7Epes20/weakmemory/cacm.pdf>:
>
> "[the CPU vendor specifications] admit the IRIW behaviour above but,
> under reasonable assumptions on the strongest x86 memory barrier,
> MFENCE, adding MFENCEs would not suffice to recover sequential
> consistency (instead, one would have to make liberal use of x86 LOCK’d
> instructions). Here the specifications seem to be much looser than the
> behaviour of implemented processors: to the best of our knowledge, and
> following some testing, IRIW is not observable in practice, even without
> MFENCEs. It appears that some JVM implementations depend on this fact,
> and would not be correct if one assumed only the IWP/AMD3.14/x86-CC
> architecture."


http://www0.cs.ucl.ac.uk/staff/j.alglave/papers/popl09.pdf has a
discussion about the suspected semantics of MFENCE, and they argue
MFENCEs are not enough if you are being paranoid about this. They also
argue Intel SDM can be read as if MFENCEs are actually serialized.

My example above also implied the fences are globally serialized. This
dubs the actual practice in both C/C++11 and JVM mappings, that assume
x86 MFENCE-s are globally serialized and MFENCE/LOCK-insns can be used
interchangeably.

POWER's hwsync seems to provide SC guarantees in the similar fashion,
"Regaining sequential consistency (SC) using sync: If one adds a sync
between every program-order pair of instructions (creating tests
SB+syncs, MP+syncs, WRC+syncs, and IRIW+syncs), then all the non-SC
results above are forbidden,"
(http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/pldi105-sarkar.pdf).

These intricate details on hardware concurrency is one of the reasons we
have JMM. JMM explicitly requires the total order on synchronization
actions, and JVM implementors are responsible to figure out how to map
it back on hardware, let it be fences or locked instructions or what.

Aside: The IRIW example on POWER requires hwsyncs before the volatile
reads, which incurs performance penalties. The discussion on whether
IRIW (and more broadly, total order that mandates IRIW) is something
that we should care about in a language memory model, is way, way above
most people's heads, including mine. Until there is an expert consensus
on the topic, laymans we are should not try to confuse ourselves with
it, before reading up and fully understanding the academic trail on the
topic.


> Also, for the newer revision of Intel's specification, “P6. In a
> multiprocessor system, stores to the same location have a total order”
> has been replaced by: “Any two stores are seen in a consistent order by
> processors other than those performing the stores.”

I think that provision is to allow store forwarding, as I described in
my original reply with Dekker idiom. MFENCE should bring the total store
order guarantees back even for the "writer" processors, since as per
Intel SDM:

"This serializing operation guarantees that every load and store
instruction that precedes the MFENCE instruction in program order
becomes *globally visible* before any load or store instruction that
follows the MFENCE instruction."

In other words, the relaxation for "writer" processors is the relaxation
for the regular TSO, not when you are using fences to force the order.


> So here's a consistent order seen by all the processors except those
> running the two writing threads:
>
> Wv0 -> T3 -> T6 -> T9 -> Wv1
>
> This also respects the total ordering for each individual site, and a
> total ordering of each individual processor's stores. The "reading"
> thread inserts its Rv0 between T9 and Wv1.

I don't understand this example. If Wv1 is serialized with fence, and
there is a read from "t" following it, that read cannot be Rwt6 then, as
it should see t=9. That pretty much deconstructs the original "execution".


-Aleksey.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Marko Topolnik-2
In reply to this post by Aleksey Shipilev-2
On Tue, Mar 17, 2015 at 3:56 PM, Aleksey Shipilev <[hidden email]> wrote:
>
>
> I offered the simplistic mental model where fences expose the values to
> global coherency in instruction order, and thus regain the total order.
> When you say "wouldn't necessarily happen on a unique global timescale"
> or "due to distributed nature of coherence", you seem to be implying
> some other model, but that definition is too vague to be useful. Indeed,
> I can come up with an arbitrarily weak model that allows pretty much
> anything.


I am trying to work my way starting from specified guarantees, taking care not to introduce tacit assumptions such as the transitivity of separately guaranteed total orders. But yes, at some point empirical evidence takes over the documented guarantees.
 
> These intricate details on hardware concurrency is one of the reasons we
> have JMM. JMM explicitly requires the total order on synchronization
> actions, and JVM implementors are responsible to figure out how to map
> it back on hardware, let it be fences or locked instructions or what.

Sure, that's exactly what I was asking about: how does JMM efficiently translate into hardware given the ever more distributed nature of the HW architecture.
 
> > Also, for the newer revision of Intel's specification, “P6. In a
> > multiprocessor system, stores to the same location have a total order”
> > has been replaced by: “Any two stores are seen in a consistent order by
> > processors other than those performing the stores.”
>
> I think that provision is to allow store forwarding, as I described in
> my original reply with Dekker idiom. MFENCE should bring the total store
> order guarantees back even for the "writer" processors

Note that the older phrase talked only about the _same_ location, therefore not applying to IRIW.
 

> > So here's a consistent order seen by all the processors except those
> > running the two writing threads:
> >
> > Wv0 -> T3 -> T6 -> T9 -> Wv1
> >
> > This also respects the total ordering for each individual site, and a
> > total ordering of each individual processor's stores. The "reading"
> > thread inserts its Rv0 between T9 and Wv1.
>
> I don't understand this example. If Wv1 is serialized with fence, and
> there is a read from "t" following it, that read cannot be Rwt6 then, as
> it should see t=9. That pretty much deconstructs the original "execution".

I didn't see the need to imply global serialization: I maintained the order of writes by each thread and the total order of actions on a single location. Only a global ordering between reads of one location and writes to another location forces Wv1 to be sandwiched between T3 and T6. Apparently, though, MFENCE does force it.

-Marko


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Aleksey Shipilev-2
On 03/17/2015 06:57 PM, Marko Topolnik wrote:
> On Tue, Mar 17, 2015 at 3:56 PM, Aleksey Shipilev
> <[hidden email] <mailto:[hidden email]>> wrote:
> I am trying to work my way starting from specified guarantees, taking
> care not to introduce tacit assumptions such as the transitivity of
> separately guaranteed total orders. But yes, at some point empirical
> evidence takes over the documented guarantees.

If you are looking for documented guarantees that include hardware
memory consistency, the full JMM consistency rules, and explain any
phenomena you throw at it, then you are in for a
<strike>surprise</strike> long haul!



>> These intricate details on hardware concurrency is one of the reasons we
>> have JMM. JMM explicitly requires the total order on synchronization
>> actions, and JVM implementors are responsible to figure out how to map
>> it back on hardware, let it be fences or locked instructions or what.
>
> Sure, that's exactly what I was asking about: how does JMM efficiently
> translate into hardware given the ever more distributed nature of the HW
> architecture.

So, big picture view: hardware does provide the totally ordered
(sequentially consistent) operations. locked insns and/or fences on x86;
hwsyncs/etc on POWER; dmb/ll/sc on ARM, etc. JMM implementations, as
well as C/C++11 implementations almost directly map their total ordered
sync ops to those hardware primitives.

It is a hardware problem how to maintain total ordering. The simple
mental model that fits simplified message-passing coherency protocol is
relatively easy to construct. It deliberately does not talk about
intricate topologies, and other bells and whistles, because the original
question was about maintaining the total order in the absence of a
globally lockable shared bus.

Thanks,
-Aleksey



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Stephan Diestelhorst
In reply to this post by Marko Topolnik-2
Am Dienstag, 17. März 2015, 14:39:32 schrieb Marko Topolnik:

> On Tue, Mar 17, 2015 at 11:46 AM, Aleksey Shipilev <
> [hidden email]> wrote:
>
> > If "sharedVar" is also volatile (sequentially consistent), then Wv1
> > would complete before reading Rwt6.
>
>
> OK, but this wouldn't necessarily happen on a unique global timescale: the
> "writing" thread would have the ordering Wv1 -> Rwt6; there would be an
> _independent_ total order of actions on currentTime, and a third, again
> independent order of actions by the "reading" thread. Due to the
> distributed nature of coherence the fact that, on one core, Wv1 precedes
> Rwt6 does not enforce Rrt6 -> Rv1 on another core. It is not obvious that
> there is transitivity between these individual orders.
>
> Particularly note this statement in
> http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf:
>
> "[the CPU vendor specifications] admit the IRIW behaviour above but, under
> reasonable assumptions on the strongest x86 memory barrier, MFENCE, adding
> MFENCEs would not suffice to recover sequential consistency (instead, one
> would have to make liberal use of x86 LOCK’d instructions). Here the
> specifications seem to be much looser than the behaviour of implemented
> processors: to the best of our knowledge, and following some testing, IRIW
> is not observable in practice, even without MFENCEs. It appears that some
> JVM implementations depend on this fact, and would not be correct if one
> assumed only the IWP/AMD3.14/x86-CC architecture."
>
> Also, for the newer revision of Intel's specification, “P6. In a
> multiprocessor system, stores to the same location have a total order” has
> been replaced by: “Any two stores are seen in a consistent order by
> processors other than those performing the stores.”

IRIW is ruled out by both AMD and Intel.  Not a problem.
Weak architectures, such as ARM, talk about multi-copy atomicity, which
is what you are after.  On these architectures, fences (DMBs in the ARM
case) do restore global order through an elaborate construction of who
saw what etc.

Nothing to see here, I presume (unless you were talking about a real
closk, such as the TSC on x86.  But that has interesting semantics and I
will not feed the trolls.  Some naive notes:
http://rp-www.cs.usyd.edu.au/~gramoli/events/wttm4/papers/diestelhorst.pdf )

Thanks,
  Stephan


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

oleksandr otenko
> IRIW is ruled out by both AMD and Intel.  Not a problem.

Does this "ruled out" mean "the IRIW results forbidden by JMM will not
be observed", or does this mean "the IRIW results forbidden by JMM are
not supported as not a problem"?

Alex



On 17/03/2015 21:42, Stephan Diestelhorst wrote:

> Am Dienstag, 17. März 2015, 14:39:32 schrieb Marko Topolnik:
>> On Tue, Mar 17, 2015 at 11:46 AM, Aleksey Shipilev <
>> [hidden email]> wrote:
>>
>>> If "sharedVar" is also volatile (sequentially consistent), then Wv1
>>> would complete before reading Rwt6.
>>
>> OK, but this wouldn't necessarily happen on a unique global timescale: the
>> "writing" thread would have the ordering Wv1 -> Rwt6; there would be an
>> _independent_ total order of actions on currentTime, and a third, again
>> independent order of actions by the "reading" thread. Due to the
>> distributed nature of coherence the fact that, on one core, Wv1 precedes
>> Rwt6 does not enforce Rrt6 -> Rv1 on another core. It is not obvious that
>> there is transitivity between these individual orders.
>>
>> Particularly note this statement in
>> http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf:
>>
>> "[the CPU vendor specifications] admit the IRIW behaviour above but, under
>> reasonable assumptions on the strongest x86 memory barrier, MFENCE, adding
>> MFENCEs would not suffice to recover sequential consistency (instead, one
>> would have to make liberal use of x86 LOCK’d instructions). Here the
>> specifications seem to be much looser than the behaviour of implemented
>> processors: to the best of our knowledge, and following some testing, IRIW
>> is not observable in practice, even without MFENCEs. It appears that some
>> JVM implementations depend on this fact, and would not be correct if one
>> assumed only the IWP/AMD3.14/x86-CC architecture."
>>
>> Also, for the newer revision of Intel's specification, “P6. In a
>> multiprocessor system, stores to the same location have a total order” has
>> been replaced by: “Any two stores are seen in a consistent order by
>> processors other than those performing the stores.”
> IRIW is ruled out by both AMD and Intel.  Not a problem.
> Weak architectures, such as ARM, talk about multi-copy atomicity, which
> is what you are after.  On these architectures, fences (DMBs in the ARM
> case) do restore global order through an elaborate construction of who
> saw what etc.
>
> Nothing to see here, I presume (unless you were talking about a real
> closk, such as the TSC on x86.  But that has interesting semantics and I
> will not feed the trolls.  Some naive notes:
> http://rp-www.cs.usyd.edu.au/~gramoli/events/wttm4/papers/diestelhorst.pdf )
>
> Thanks,
>    Stephan
>
>
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Enforcing total sync order on modern hardware

Marko Topolnik-2
In reply to this post by Stephan Diestelhorst

On Tue, Mar 17, 2015 at 10:42 PM, Stephan Diestelhorst <[hidden email]> wrote:

IRIW is ruled out by both AMD and Intel.  Not a problem.

My scenario is not quite IRIW. There's one pure writer, one pure reader, and one reader-writer. There aren't any two reading processors which conflict in their observation of the store order. For the reader-writer, its stores and loads are independent so, without a fence, the visibility of the store may be postponed and the processor is allowed to observe its own stores sooner than all others.

It seems that the JVM will emit such a fence after a volatile store, which will ensure that the store has become visible to everyone before attempting any further loads. That would cause the Rwt6 load to impose an indirect happens-before to the Rrt9 load by the reading processor, transitively forcing Wv1 to happen before Rv0, preventing that load to observe the old value of 0.

On the other hand, the compiler might be tempted to optimize by eliminating some of the fences for the case of independent stores and loads. In our example just relying on TSO would be enough to satisfy happens-before consistency for actions done by the reading-writing processor. Studying Aleksey's Nanotrusting the Nanotime we find that doing some navel-gazing CPU work after a volatile write eliminates the volatile overheads, indicating that computation was able to continue without waiting for the store to complete. It's just that the total sync order would be very difficult to achieve if the computation involved volatile reads (even if they were independent).
 
Weak architectures, such as ARM, talk about multi-copy atomicity, which
is what you are after.  On these architectures, fences (DMBs in the ARM
case) do restore global order through an elaborate construction of who
saw what etc.

A naive idea would be that a single global memory location was CASed on each sync operation, allowing everything to piggyback on the total ordering of CAS operations. I guess in reality there are smarter tricks :)

--
Marko

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
123