Mostly Thread Local

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Mostly Thread Local

JSR166 Concurrency mailing list
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
I would use a JCTools' queue, such as MpscGrowableArrayQueue has been excellent for my case. I use it for a write buffer to replay events under a lock, without blocking writers. Since it is bounded, when full I fallback to blocking the producers which improves throughput compared to busy waiting. For lossy events, I use a fixed ring buffer and a fork of Doug Lea's Striped64 to add buffers as contention occurs. In that case if full then I let events be dropped, schedule the drain, and do that work under a lock.

On Mon, Apr 29, 2019 at 10:48 AM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
You may want to look at my WriterReaderPhaser, which was designed for the exact pattern you describe: writer(s) you care about, reader(s) that are "not in a hurry", but needing synchronization on the data to avoid lossiness/double-counting/etc.. Wait free for writers, blocking for readers (but guaranteed forward progress).

It's pretty well explained in:

Some Implementations:

— Gil.

On Apr 25, 2019, at 1:27 PM, Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:

Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
@Gil: I had seen that before, though I think it is optimized for the multiproducer case.  The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on.  I guess thats okay, but then the writer has to syncrhonize before each write.

@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
end=...getVolatile is allowed to go ahead of any or all of arraycopy. So it's giving you bogus boundaries, and you can't guarantee the decision based on it is correct.

Then setRelease guarantees not reordering with the preceding loads and stores, but doesn't seem to guarantee not reordering with subsequent stores, eg buf[i+1]. So you can't guarantee anything at all.

So you want store-store before and after buf[i]=val

Alex

On Mon, 29 Apr 2019, 22:50 Carl Mastrangelo, <[hidden email]> wrote:
@Gil: I had seen that before, though I think it is optimized for the multiproducer case.  The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on.  I guess thats okay, but then the writer has to syncrhonize before each write.

@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list


On Apr 29, 2019, at 2:49 PM, Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:

@Gil: I had seen that before, though I think it is optimized for the multiproducer case. 

I guess the way I see it, the multi-producer support is just a nice side-effect. Even with a single writer, you'd need that writer to indicate critical section boundaries in so me way (or coordinate in some other way with e.g. atomically modified caret position trackers in your ring buffer) if it wants to remain wait-free while coordinating with a reader.

The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on. 

The common pattern is a classic double-buffer. The writer(s) only ever interacts with the "active" data structures, and the reader will typically have one "inactive" data structure that it is working on, knowing that the inactive data is at rest because no writer(s) are interacting with it. Efficient implementations will "flip" the active and inactive versions and never allocate a new structure.

You can use other patterns (e.g. multi-interval buffers used in moving window applications), but the double-buffer is by far the simplest and most common.

In your use case, you would just have active and inactive instances of your ringbuffer, with the reader controlling which is which (and flipping between them to read), and the writer (the logger in your case) only ever writing to the active one. If you size your buffer large enough and the reader visits often enough (for the buffer size and write rate),  you can maintain lossless reading.

Whether or not you need allocation for the actual elements you are logging is up to you and what you are logging. You can certainly have each of the two ring buffer instance be an array of references to e.g. String, with String entries instantiated and allocated per logging event, but zero allocation recorders or loggers with this pattern are common, where each of the two instances are pre-allocated to hold the entire data needed. E.g. if a reasonable cap for a logging entry size is known, all entries in the ring buffer can be pre-allocated.

I guess thats okay, but then the writer has to syncrhonize before each write.

The synchronization cost on the writer side is a single atomic increment on each side of a critical section. This (with a simple array based ring buffer and non-atomic manipulation of the "active caret" in that buffer since you have only a single writer) is probably much cheaper than using ConcurrentHashMap, ThreadLocal lookups, and WeakRef manipulations.

As for "coarsening": the critical section can have as many data operations folded together in a "coarsening effect" as you want.


@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
This conflicts with my understanding of volatile, which is that memory accesses can not be moved across it.  I thought it was the same as a acquire+release.   


On Mon, Apr 29, 2019 at 2:55 PM Alex Otenko <[hidden email]> wrote:
end=...getVolatile is allowed to go ahead of any or all of arraycopy. So it's giving you bogus boundaries, and you can't guarantee the decision based on it is correct.

Then setRelease guarantees not reordering with the preceding loads and stores, but doesn't seem to guarantee not reordering with subsequent stores, eg buf[i+1]. So you can't guarantee anything at all.

So you want store-store before and after buf[i]=val

Alex

On Mon, 29 Apr 2019, 22:50 Carl Mastrangelo, <[hidden email]> wrote:
@Gil: I had seen that before, though I think it is optimized for the multiproducer case.  The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on.  I guess thats okay, but then the writer has to syncrhonize before each write.

@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list


On Mon, Apr 29, 2019 at 3:28 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 2:49 PM, Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:

@Gil: I had seen that before, though I think it is optimized for the multiproducer case. 

I guess the way I see it, the multi-producer support is just a nice side-effect. Even with a single writer, you'd need that writer to indicate critical section boundaries in so me way (or coordinate in some other way with e.g. atomically modified caret position trackers in your ring buffer) if it wants to remain wait-free while coordinating with a reader.

The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on. 

The common pattern is a classic double-buffer. The writer(s) only ever interacts with the "active" data structures, and the reader will typically have one "inactive" data structure that it is working on, knowing that the inactive data is at rest because no writer(s) are interacting with it. Efficient implementations will "flip" the active and inactive versions and never allocate a new structure.

I am worried that this would affect the fast write path, since now it would have to reverify the array is not null, and reverify the bounds check for the write.  Since this write is not done in a loop, I don't think the bounds check or null check can be so easily eliminated, which it may be able to if the array was in a final variable.  

 

You can use other patterns (e.g. multi-interval buffers used in moving window applications), but the double-buffer is by far the simplest and most common.

In your use case, you would just have active and inactive instances of your ringbuffer, with the reader controlling which is which (and flipping between them to read), and the writer (the logger in your case) only ever writing to the active one. If you size your buffer large enough and the reader visits often enough (for the buffer size and write rate),  you can maintain lossless reading.

Whether or not you need allocation for the actual elements you are logging is up to you and what you are logging. You can certainly have each of the two ring buffer instance be an array of references to e.g. String, with String entries instantiated and allocated per logging event, but zero allocation recorders or loggers with this pattern are common, where each of the two instances are pre-allocated to hold the entire data needed. E.g. if a reasonable cap for a logging entry size is known, all entries in the ring buffer can be pre-allocated.

I am logging strings, but they are preallocated constants.   I'm confident there won't be any memory allocation (and use JMH's gc profiler to check this).   
 

I guess thats okay, but then the writer has to syncrhonize before each write.

The synchronization cost on the writer side is a single atomic increment on each side of a critical section. This (with a simple array based ring buffer and non-atomic manipulation of the "active caret" in that buffer since you have only a single writer) is probably much cheaper than using ConcurrentHashMap, ThreadLocal lookups, and WeakRef manipulations.

I may not have been clear, the ConcurrentHashMap and Weakref code is an index for the reader to track down each thread's data.  I think the threadlocal is still needed for contention free writing.  
 

As for "coarsening": the critical section can have as many data operations folded together in a "coarsening effect" as you want.


@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
Acquire/release is a confusing way of looking at things.

Volatile accesses are totally ordered only between themselves. Any other accesses are not totally ordered with respect to volatile accesses - the threads do not agree on a single order in which they appear to them. In colloquial terms this means some reordering of normal accesses. In particular, volatile reads may appear going ahead of any normal read or write, as seen by some thread - others do not necessarily agree to that and may observe some other order.

Alex

On Tue, 30 Apr 2019, 00:35 Carl Mastrangelo, <[hidden email]> wrote:
This conflicts with my understanding of volatile, which is that memory accesses can not be moved across it.  I thought it was the same as a acquire+release.   


On Mon, Apr 29, 2019 at 2:55 PM Alex Otenko <[hidden email]> wrote:
end=...getVolatile is allowed to go ahead of any or all of arraycopy. So it's giving you bogus boundaries, and you can't guarantee the decision based on it is correct.

Then setRelease guarantees not reordering with the preceding loads and stores, but doesn't seem to guarantee not reordering with subsequent stores, eg buf[i+1]. So you can't guarantee anything at all.

So you want store-store before and after buf[i]=val

Alex

On Mon, 29 Apr 2019, 22:50 Carl Mastrangelo, <[hidden email]> wrote:
@Gil: I had seen that before, though I think it is optimized for the multiproducer case.  The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on.  I guess thats okay, but then the writer has to syncrhonize before each write.

@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list


On Apr 29, 2019, at 4:46 PM, Carl Mastrangelo <[hidden email]> wrote:



On Mon, Apr 29, 2019 at 3:28 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 2:49 PM, Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:

@Gil: I had seen that before, though I think it is optimized for the multiproducer case. 

I guess the way I see it, the multi-producer support is just a nice side-effect. Even with a single writer, you'd need that writer to indicate critical section boundaries in so me way (or coordinate in some other way with e.g. atomically modified caret position trackers in your ring buffer) if it wants to remain wait-free while coordinating with a reader.

The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on. 

The common pattern is a classic double-buffer. The writer(s) only ever interacts with the "active" data structures, and the reader will typically have one "inactive" data structure that it is working on, knowing that the inactive data is at rest because no writer(s) are interacting with it. Efficient implementations will "flip" the active and inactive versions and never allocate a new structure.

I am worried that this would affect the fast write path, since now it would have to reverify the array is not null, and reverify the bounds check for the write.  Since this write is not done in a loop, I don't think the bounds check or null check can be so easily eliminated, which it may be able to if the array was in a final variable.  

Since this (seems to be) targeting Java:

a) Not that it's material anyway, but the null check is actually "free". Modern JVMs don't actually check for nulls in the hot path. Instead, on attempting to access through null references, they catch SEGV's, deopt, and throw the exception as if the check was there. So as long as you don't actually try to use a null array, you won't see any null check costs.

b) The bounds check would not be eliminated if the writes are not in a loop, but the bounds check cost is probably negligible compared to the rest of the work involved (e.g. compared to the two atomic increments in my scheme, or compared to atomic updates or whatever other synchronization mechanisms you would need to use in other schemes in order to prevent "funny" situations with the reader and the carets).


 

You can use other patterns (e.g. multi-interval buffers used in moving window applications), but the double-buffer is by far the simplest and most common.

In your use case, you would just have active and inactive instances of your ringbuffer, with the reader controlling which is which (and flipping between them to read), and the writer (the logger in your case) only ever writing to the active one. If you size your buffer large enough and the reader visits often enough (for the buffer size and write rate),  you can maintain lossless reading.

Whether or not you need allocation for the actual elements you are logging is up to you and what you are logging. You can certainly have each of the two ring buffer instance be an array of references to e.g. String, with String entries instantiated and allocated per logging event, but zero allocation recorders or loggers with this pattern are common, where each of the two instances are pre-allocated to hold the entire data needed. E.g. if a reasonable cap for a logging entry size is known, all entries in the ring buffer can be pre-allocated.

I am logging strings, but they are preallocated constants.   I'm confident there won't be any memory allocation (and use JMH's gc profiler to check this).   
 

I guess thats okay, but then the writer has to syncrhonize before each write.

The synchronization cost on the writer side is a single atomic increment on each side of a critical section. This (with a simple array based ring buffer and non-atomic manipulation of the "active caret" in that buffer since you have only a single writer) is probably much cheaper than using ConcurrentHashMap, ThreadLocal lookups, and WeakRef manipulations.

I may not have been clear, the ConcurrentHashMap and Weakref code is an index for the reader to track down each thread's data.  I think the threadlocal is still needed for contention free writing.  
 

As for "coarsening": the critical section can have as many data operations folded together in a "coarsening effect" as you want.


@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
I tried out an implementation inspired by WriterReaderPhaser with my code, and was able to get the the per method call time down to about 11ns for the two atomic increments.  This is low, but the original snippet I provided (with volatile read and lazy-set) brings the time to about 3ns.  If there is some way to keep this lower time, it would be highly preferable.   It would be good if the synchronization overhead is as low as possible.   

Comments below on the checks:

On Mon, Apr 29, 2019 at 8:15 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 4:46 PM, Carl Mastrangelo <[hidden email]> wrote:



On Mon, Apr 29, 2019 at 3:28 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 2:49 PM, Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:

@Gil: I had seen that before, though I think it is optimized for the multiproducer case. 

I guess the way I see it, the multi-producer support is just a nice side-effect. Even with a single writer, you'd need that writer to indicate critical section boundaries in so me way (or coordinate in some other way with e.g. atomically modified caret position trackers in your ring buffer) if it wants to remain wait-free while coordinating with a reader.

The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on. 

The common pattern is a classic double-buffer. The writer(s) only ever interacts with the "active" data structures, and the reader will typically have one "inactive" data structure that it is working on, knowing that the inactive data is at rest because no writer(s) are interacting with it. Efficient implementations will "flip" the active and inactive versions and never allocate a new structure.

I am worried that this would affect the fast write path, since now it would have to reverify the array is not null, and reverify the bounds check for the write.  Since this write is not done in a loop, I don't think the bounds check or null check can be so easily eliminated, which it may be able to if the array was in a final variable.  

Since this (seems to be) targeting Java:

a) Not that it's material anyway, but the null check is actually "free". Modern JVMs don't actually check for nulls in the hot path. Instead, on attempting to access through null references, they catch SEGV's, deopt, and throw the exception as if the check was there. So as long as you don't actually try to use a null array, you won't see any null check costs.

The free-ness of this is in question.   I looked at the PrintAssembly output and it seems like the null check comes free because the length of the array is read. The problem is that it isn't free, at least according to the percentages from perfasm.  The mov instruction varied anywhere from 2-5% of the time of that 3ns duration.  (its low I know, but it was one of the hotter spots).
 

b) The bounds check would not be eliminated if the writes are not in a loop, but the bounds check cost is probably negligible compared to the rest of the work involved (e.g. compared to the two atomic increments in my scheme, or compared to atomic updates or whatever other synchronization mechanisms you would need to use in other schemes in order to prevent "funny" situations with the reader and the carets).

I thought compilers have some sort of Prover phase, where mathematical identities are known that allow it to skip checks.   For example:

private static final int SIZE = 16;
private static final int MASK = 0xF;
private final T[] buf = new T[SIZE];

...
buf[i & MASK] = val

In this example, the compiler provably knows 0 <= i & MASK < 16, it knows the buf is not going to change, and the length of the array is constant.   Even outside of a loop, it seems like optimizing compilers should know this.


 


 

You can use other patterns (e.g. multi-interval buffers used in moving window applications), but the double-buffer is by far the simplest and most common.

In your use case, you would just have active and inactive instances of your ringbuffer, with the reader controlling which is which (and flipping between them to read), and the writer (the logger in your case) only ever writing to the active one. If you size your buffer large enough and the reader visits often enough (for the buffer size and write rate),  you can maintain lossless reading.

Whether or not you need allocation for the actual elements you are logging is up to you and what you are logging. You can certainly have each of the two ring buffer instance be an array of references to e.g. String, with String entries instantiated and allocated per logging event, but zero allocation recorders or loggers with this pattern are common, where each of the two instances are pre-allocated to hold the entire data needed. E.g. if a reasonable cap for a logging entry size is known, all entries in the ring buffer can be pre-allocated.

I am logging strings, but they are preallocated constants.   I'm confident there won't be any memory allocation (and use JMH's gc profiler to check this).   
 

I guess thats okay, but then the writer has to syncrhonize before each write.

The synchronization cost on the writer side is a single atomic increment on each side of a critical section. This (with a simple array based ring buffer and non-atomic manipulation of the "active caret" in that buffer since you have only a single writer) is probably much cheaper than using ConcurrentHashMap, ThreadLocal lookups, and WeakRef manipulations.

I may not have been clear, the ConcurrentHashMap and Weakref code is an index for the reader to track down each thread's data.  I think the threadlocal is still needed for contention free writing.  
 

As for "coarsening": the critical section can have as many data operations folded together in a "coarsening effect" as you want.


@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
As a followup, I put together a JCStress test that checks for races in my example code above.  I can't say that there aren't any races in the code I proposed, but I wasn't able to find any, and I tested several off-by-one variants of the code to prove that races do exist.  I only tested on my Skylake processor, so there are some races that I'm sure x86 is hiding from me.  I'll try finding a multicore ARM processor and run there too.

The code (30LoC) is here if anyone would like to look at it: 



On Tue, Apr 30, 2019 at 10:47 AM Carl Mastrangelo <[hidden email]> wrote:
I tried out an implementation inspired by WriterReaderPhaser with my code, and was able to get the the per method call time down to about 11ns for the two atomic increments.  This is low, but the original snippet I provided (with volatile read and lazy-set) brings the time to about 3ns.  If there is some way to keep this lower time, it would be highly preferable.   It would be good if the synchronization overhead is as low as possible.   

Comments below on the checks:

On Mon, Apr 29, 2019 at 8:15 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 4:46 PM, Carl Mastrangelo <[hidden email]> wrote:



On Mon, Apr 29, 2019 at 3:28 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 2:49 PM, Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:

@Gil: I had seen that before, though I think it is optimized for the multiproducer case. 

I guess the way I see it, the multi-producer support is just a nice side-effect. Even with a single writer, you'd need that writer to indicate critical section boundaries in so me way (or coordinate in some other way with e.g. atomically modified caret position trackers in your ring buffer) if it wants to remain wait-free while coordinating with a reader.

The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on. 

The common pattern is a classic double-buffer. The writer(s) only ever interacts with the "active" data structures, and the reader will typically have one "inactive" data structure that it is working on, knowing that the inactive data is at rest because no writer(s) are interacting with it. Efficient implementations will "flip" the active and inactive versions and never allocate a new structure.

I am worried that this would affect the fast write path, since now it would have to reverify the array is not null, and reverify the bounds check for the write.  Since this write is not done in a loop, I don't think the bounds check or null check can be so easily eliminated, which it may be able to if the array was in a final variable.  

Since this (seems to be) targeting Java:

a) Not that it's material anyway, but the null check is actually "free". Modern JVMs don't actually check for nulls in the hot path. Instead, on attempting to access through null references, they catch SEGV's, deopt, and throw the exception as if the check was there. So as long as you don't actually try to use a null array, you won't see any null check costs.

The free-ness of this is in question.   I looked at the PrintAssembly output and it seems like the null check comes free because the length of the array is read. The problem is that it isn't free, at least according to the percentages from perfasm.  The mov instruction varied anywhere from 2-5% of the time of that 3ns duration.  (its low I know, but it was one of the hotter spots).
 

b) The bounds check would not be eliminated if the writes are not in a loop, but the bounds check cost is probably negligible compared to the rest of the work involved (e.g. compared to the two atomic increments in my scheme, or compared to atomic updates or whatever other synchronization mechanisms you would need to use in other schemes in order to prevent "funny" situations with the reader and the carets).

I thought compilers have some sort of Prover phase, where mathematical identities are known that allow it to skip checks.   For example:

private static final int SIZE = 16;
private static final int MASK = 0xF;
private final T[] buf = new T[SIZE];

...
buf[i & MASK] = val

In this example, the compiler provably knows 0 <= i & MASK < 16, it knows the buf is not going to change, and the length of the array is constant.   Even outside of a loop, it seems like optimizing compilers should know this.


 


 

You can use other patterns (e.g. multi-interval buffers used in moving window applications), but the double-buffer is by far the simplest and most common.

In your use case, you would just have active and inactive instances of your ringbuffer, with the reader controlling which is which (and flipping between them to read), and the writer (the logger in your case) only ever writing to the active one. If you size your buffer large enough and the reader visits often enough (for the buffer size and write rate),  you can maintain lossless reading.

Whether or not you need allocation for the actual elements you are logging is up to you and what you are logging. You can certainly have each of the two ring buffer instance be an array of references to e.g. String, with String entries instantiated and allocated per logging event, but zero allocation recorders or loggers with this pattern are common, where each of the two instances are pre-allocated to hold the entire data needed. E.g. if a reasonable cap for a logging entry size is known, all entries in the ring buffer can be pre-allocated.

I am logging strings, but they are preallocated constants.   I'm confident there won't be any memory allocation (and use JMH's gc profiler to check this).   
 

I guess thats okay, but then the writer has to syncrhonize before each write.

The synchronization cost on the writer side is a single atomic increment on each side of a critical section. This (with a simple array based ring buffer and non-atomic manipulation of the "active caret" in that buffer since you have only a single writer) is probably much cheaper than using ConcurrentHashMap, ThreadLocal lookups, and WeakRef manipulations.

I may not have been clear, the ConcurrentHashMap and Weakref code is an index for the reader to track down each thread's data.  I think the threadlocal is still needed for contention free writing.  
 

As for "coarsening": the critical section can have as many data operations folded together in a "coarsening effect" as you want.


@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
You are using methods that result in correct results on x86.

All reads are performed in program order on x86, so as long as the compiler doesn't reorder them, the effects JMM allows are not manifesting themselves. 

All writes are in total order, too.

Alex

On Thu, 2 May 2019, 06:48 Carl Mastrangelo, <[hidden email]> wrote:
As a followup, I put together a JCStress test that checks for races in my example code above.  I can't say that there aren't any races in the code I proposed, but I wasn't able to find any, and I tested several off-by-one variants of the code to prove that races do exist.  I only tested on my Skylake processor, so there are some races that I'm sure x86 is hiding from me.  I'll try finding a multicore ARM processor and run there too.

The code (30LoC) is here if anyone would like to look at it: 



On Tue, Apr 30, 2019 at 10:47 AM Carl Mastrangelo <[hidden email]> wrote:
I tried out an implementation inspired by WriterReaderPhaser with my code, and was able to get the the per method call time down to about 11ns for the two atomic increments.  This is low, but the original snippet I provided (with volatile read and lazy-set) brings the time to about 3ns.  If there is some way to keep this lower time, it would be highly preferable.   It would be good if the synchronization overhead is as low as possible.   

Comments below on the checks:

On Mon, Apr 29, 2019 at 8:15 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 4:46 PM, Carl Mastrangelo <[hidden email]> wrote:



On Mon, Apr 29, 2019 at 3:28 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 2:49 PM, Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:

@Gil: I had seen that before, though I think it is optimized for the multiproducer case. 

I guess the way I see it, the multi-producer support is just a nice side-effect. Even with a single writer, you'd need that writer to indicate critical section boundaries in so me way (or coordinate in some other way with e.g. atomically modified caret position trackers in your ring buffer) if it wants to remain wait-free while coordinating with a reader.

The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on. 

The common pattern is a classic double-buffer. The writer(s) only ever interacts with the "active" data structures, and the reader will typically have one "inactive" data structure that it is working on, knowing that the inactive data is at rest because no writer(s) are interacting with it. Efficient implementations will "flip" the active and inactive versions and never allocate a new structure.

I am worried that this would affect the fast write path, since now it would have to reverify the array is not null, and reverify the bounds check for the write.  Since this write is not done in a loop, I don't think the bounds check or null check can be so easily eliminated, which it may be able to if the array was in a final variable.  

Since this (seems to be) targeting Java:

a) Not that it's material anyway, but the null check is actually "free". Modern JVMs don't actually check for nulls in the hot path. Instead, on attempting to access through null references, they catch SEGV's, deopt, and throw the exception as if the check was there. So as long as you don't actually try to use a null array, you won't see any null check costs.

The free-ness of this is in question.   I looked at the PrintAssembly output and it seems like the null check comes free because the length of the array is read. The problem is that it isn't free, at least according to the percentages from perfasm.  The mov instruction varied anywhere from 2-5% of the time of that 3ns duration.  (its low I know, but it was one of the hotter spots).
 

b) The bounds check would not be eliminated if the writes are not in a loop, but the bounds check cost is probably negligible compared to the rest of the work involved (e.g. compared to the two atomic increments in my scheme, or compared to atomic updates or whatever other synchronization mechanisms you would need to use in other schemes in order to prevent "funny" situations with the reader and the carets).

I thought compilers have some sort of Prover phase, where mathematical identities are known that allow it to skip checks.   For example:

private static final int SIZE = 16;
private static final int MASK = 0xF;
private final T[] buf = new T[SIZE];

...
buf[i & MASK] = val

In this example, the compiler provably knows 0 <= i & MASK < 16, it knows the buf is not going to change, and the length of the array is constant.   Even outside of a loop, it seems like optimizing compilers should know this.


 


 

You can use other patterns (e.g. multi-interval buffers used in moving window applications), but the double-buffer is by far the simplest and most common.

In your use case, you would just have active and inactive instances of your ringbuffer, with the reader controlling which is which (and flipping between them to read), and the writer (the logger in your case) only ever writing to the active one. If you size your buffer large enough and the reader visits often enough (for the buffer size and write rate),  you can maintain lossless reading.

Whether or not you need allocation for the actual elements you are logging is up to you and what you are logging. You can certainly have each of the two ring buffer instance be an array of references to e.g. String, with String entries instantiated and allocated per logging event, but zero allocation recorders or loggers with this pattern are common, where each of the two instances are pre-allocated to hold the entire data needed. E.g. if a reasonable cap for a logging entry size is known, all entries in the ring buffer can be pre-allocated.

I am logging strings, but they are preallocated constants.   I'm confident there won't be any memory allocation (and use JMH's gc profiler to check this).   
 

I guess thats okay, but then the writer has to syncrhonize before each write.

The synchronization cost on the writer side is a single atomic increment on each side of a critical section. This (with a simple array based ring buffer and non-atomic manipulation of the "active caret" in that buffer since you have only a single writer) is probably much cheaper than using ConcurrentHashMap, ThreadLocal lookups, and WeakRef manipulations.

I may not have been clear, the ConcurrentHashMap and Weakref code is an index for the reader to track down each thread's data.  I think the threadlocal is still needed for contention free writing.  
 

As for "coarsening": the critical section can have as many data operations folded together in a "coarsening effect" as you want.


@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
After reading some of the code in StampedLock I think I can propose something that addresses the reordering:

int[] data;
int idx;
VarHandle DATA;
VarHandle IDX;

Writer:
DATA.setOpaque(this, val);
IDX.setRelease(this, i + 1);
VarHandle.storeStoreFence();

Reader:
i1 = IDX.getAcquire(this);
VarHandle.loadLoadFence();
dataCopy = // Opaque copy of data
VarHandle.loadLoadFence();
i2 = IDX.getAcquire(this);

I think with the fences, the writer index store cannot be reordered incorrectly with the data store.  Also, the reader indexes cannot be reordered with the array copy.  The goal is to get as much of the data array as possible.  It's okay to lose some of the data between the start and end of the copy, as long as it's possible to be confident about the unchanged part of the data array.  

One thing is puzzling though:  StampedLock has a similar pattern, but uses plain reads of the data.  I am not sure how this is valid, since I thought Opaque level was needed for non tearing reads (specifically in the "informal proof" section at the top).   





On Thu, May 2, 2019 at 2:24 AM Alex Otenko <[hidden email]> wrote:
You are using methods that result in correct results on x86.

All reads are performed in program order on x86, so as long as the compiler doesn't reorder them, the effects JMM allows are not manifesting themselves. 

All writes are in total order, too.

Alex

On Thu, 2 May 2019, 06:48 Carl Mastrangelo, <[hidden email]> wrote:
As a followup, I put together a JCStress test that checks for races in my example code above.  I can't say that there aren't any races in the code I proposed, but I wasn't able to find any, and I tested several off-by-one variants of the code to prove that races do exist.  I only tested on my Skylake processor, so there are some races that I'm sure x86 is hiding from me.  I'll try finding a multicore ARM processor and run there too.

The code (30LoC) is here if anyone would like to look at it: 



On Tue, Apr 30, 2019 at 10:47 AM Carl Mastrangelo <[hidden email]> wrote:
I tried out an implementation inspired by WriterReaderPhaser with my code, and was able to get the the per method call time down to about 11ns for the two atomic increments.  This is low, but the original snippet I provided (with volatile read and lazy-set) brings the time to about 3ns.  If there is some way to keep this lower time, it would be highly preferable.   It would be good if the synchronization overhead is as low as possible.   

Comments below on the checks:

On Mon, Apr 29, 2019 at 8:15 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 4:46 PM, Carl Mastrangelo <[hidden email]> wrote:



On Mon, Apr 29, 2019 at 3:28 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 2:49 PM, Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:

@Gil: I had seen that before, though I think it is optimized for the multiproducer case. 

I guess the way I see it, the multi-producer support is just a nice side-effect. Even with a single writer, you'd need that writer to indicate critical section boundaries in so me way (or coordinate in some other way with e.g. atomically modified caret position trackers in your ring buffer) if it wants to remain wait-free while coordinating with a reader.

The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on. 

The common pattern is a classic double-buffer. The writer(s) only ever interacts with the "active" data structures, and the reader will typically have one "inactive" data structure that it is working on, knowing that the inactive data is at rest because no writer(s) are interacting with it. Efficient implementations will "flip" the active and inactive versions and never allocate a new structure.

I am worried that this would affect the fast write path, since now it would have to reverify the array is not null, and reverify the bounds check for the write.  Since this write is not done in a loop, I don't think the bounds check or null check can be so easily eliminated, which it may be able to if the array was in a final variable.  

Since this (seems to be) targeting Java:

a) Not that it's material anyway, but the null check is actually "free". Modern JVMs don't actually check for nulls in the hot path. Instead, on attempting to access through null references, they catch SEGV's, deopt, and throw the exception as if the check was there. So as long as you don't actually try to use a null array, you won't see any null check costs.

The free-ness of this is in question.   I looked at the PrintAssembly output and it seems like the null check comes free because the length of the array is read. The problem is that it isn't free, at least according to the percentages from perfasm.  The mov instruction varied anywhere from 2-5% of the time of that 3ns duration.  (its low I know, but it was one of the hotter spots).
 

b) The bounds check would not be eliminated if the writes are not in a loop, but the bounds check cost is probably negligible compared to the rest of the work involved (e.g. compared to the two atomic increments in my scheme, or compared to atomic updates or whatever other synchronization mechanisms you would need to use in other schemes in order to prevent "funny" situations with the reader and the carets).

I thought compilers have some sort of Prover phase, where mathematical identities are known that allow it to skip checks.   For example:

private static final int SIZE = 16;
private static final int MASK = 0xF;
private final T[] buf = new T[SIZE];

...
buf[i & MASK] = val

In this example, the compiler provably knows 0 <= i & MASK < 16, it knows the buf is not going to change, and the length of the array is constant.   Even outside of a loop, it seems like optimizing compilers should know this.


 


 

You can use other patterns (e.g. multi-interval buffers used in moving window applications), but the double-buffer is by far the simplest and most common.

In your use case, you would just have active and inactive instances of your ringbuffer, with the reader controlling which is which (and flipping between them to read), and the writer (the logger in your case) only ever writing to the active one. If you size your buffer large enough and the reader visits often enough (for the buffer size and write rate),  you can maintain lossless reading.

Whether or not you need allocation for the actual elements you are logging is up to you and what you are logging. You can certainly have each of the two ring buffer instance be an array of references to e.g. String, with String entries instantiated and allocated per logging event, but zero allocation recorders or loggers with this pattern are common, where each of the two instances are pre-allocated to hold the entire data needed. E.g. if a reasonable cap for a logging entry size is known, all entries in the ring buffer can be pre-allocated.

I am logging strings, but they are preallocated constants.   I'm confident there won't be any memory allocation (and use JMH's gc profiler to check this).   
 

I guess thats okay, but then the writer has to syncrhonize before each write.

The synchronization cost on the writer side is a single atomic increment on each side of a critical section. This (with a simple array based ring buffer and non-atomic manipulation of the "active caret" in that buffer since you have only a single writer) is probably much cheaper than using ConcurrentHashMap, ThreadLocal lookups, and WeakRef manipulations.

I may not have been clear, the ConcurrentHashMap and Weakref code is an index for the reader to track down each thread's data.  I think the threadlocal is still needed for contention free writing.  
 

As for "coarsening": the critical section can have as many data operations folded together in a "coarsening effect" as you want.


@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
Looks ok. You don't need any fences immediately after getAcquire.

With fences as is normal reads will happen in the right order, too. There only needs to be certainty that they will be materialized. I am not sure what guarantees the compiler will not hoist normal reads.
Alex

On Tue, 21 May 2019, 19:40 Carl Mastrangelo, <[hidden email]> wrote:
After reading some of the code in StampedLock I think I can propose something that addresses the reordering:

int[] data;
int idx;
VarHandle DATA;
VarHandle IDX;

Writer:
DATA.setOpaque(this, val);
IDX.setRelease(this, i + 1);
VarHandle.storeStoreFence();

Reader:
i1 = IDX.getAcquire(this);
VarHandle.loadLoadFence();
dataCopy = // Opaque copy of data
VarHandle.loadLoadFence();
i2 = IDX.getAcquire(this);

I think with the fences, the writer index store cannot be reordered incorrectly with the data store.  Also, the reader indexes cannot be reordered with the array copy.  The goal is to get as much of the data array as possible.  It's okay to lose some of the data between the start and end of the copy, as long as it's possible to be confident about the unchanged part of the data array.  

One thing is puzzling though:  StampedLock has a similar pattern, but uses plain reads of the data.  I am not sure how this is valid, since I thought Opaque level was needed for non tearing reads (specifically in the "informal proof" section at the top).   





On Thu, May 2, 2019 at 2:24 AM Alex Otenko <[hidden email]> wrote:
You are using methods that result in correct results on x86.

All reads are performed in program order on x86, so as long as the compiler doesn't reorder them, the effects JMM allows are not manifesting themselves. 

All writes are in total order, too.

Alex

On Thu, 2 May 2019, 06:48 Carl Mastrangelo, <[hidden email]> wrote:
As a followup, I put together a JCStress test that checks for races in my example code above.  I can't say that there aren't any races in the code I proposed, but I wasn't able to find any, and I tested several off-by-one variants of the code to prove that races do exist.  I only tested on my Skylake processor, so there are some races that I'm sure x86 is hiding from me.  I'll try finding a multicore ARM processor and run there too.

The code (30LoC) is here if anyone would like to look at it: 



On Tue, Apr 30, 2019 at 10:47 AM Carl Mastrangelo <[hidden email]> wrote:
I tried out an implementation inspired by WriterReaderPhaser with my code, and was able to get the the per method call time down to about 11ns for the two atomic increments.  This is low, but the original snippet I provided (with volatile read and lazy-set) brings the time to about 3ns.  If there is some way to keep this lower time, it would be highly preferable.   It would be good if the synchronization overhead is as low as possible.   

Comments below on the checks:

On Mon, Apr 29, 2019 at 8:15 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 4:46 PM, Carl Mastrangelo <[hidden email]> wrote:



On Mon, Apr 29, 2019 at 3:28 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 2:49 PM, Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:

@Gil: I had seen that before, though I think it is optimized for the multiproducer case. 

I guess the way I see it, the multi-producer support is just a nice side-effect. Even with a single writer, you'd need that writer to indicate critical section boundaries in so me way (or coordinate in some other way with e.g. atomically modified caret position trackers in your ring buffer) if it wants to remain wait-free while coordinating with a reader.

The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on. 

The common pattern is a classic double-buffer. The writer(s) only ever interacts with the "active" data structures, and the reader will typically have one "inactive" data structure that it is working on, knowing that the inactive data is at rest because no writer(s) are interacting with it. Efficient implementations will "flip" the active and inactive versions and never allocate a new structure.

I am worried that this would affect the fast write path, since now it would have to reverify the array is not null, and reverify the bounds check for the write.  Since this write is not done in a loop, I don't think the bounds check or null check can be so easily eliminated, which it may be able to if the array was in a final variable.  

Since this (seems to be) targeting Java:

a) Not that it's material anyway, but the null check is actually "free". Modern JVMs don't actually check for nulls in the hot path. Instead, on attempting to access through null references, they catch SEGV's, deopt, and throw the exception as if the check was there. So as long as you don't actually try to use a null array, you won't see any null check costs.

The free-ness of this is in question.   I looked at the PrintAssembly output and it seems like the null check comes free because the length of the array is read. The problem is that it isn't free, at least according to the percentages from perfasm.  The mov instruction varied anywhere from 2-5% of the time of that 3ns duration.  (its low I know, but it was one of the hotter spots).
 

b) The bounds check would not be eliminated if the writes are not in a loop, but the bounds check cost is probably negligible compared to the rest of the work involved (e.g. compared to the two atomic increments in my scheme, or compared to atomic updates or whatever other synchronization mechanisms you would need to use in other schemes in order to prevent "funny" situations with the reader and the carets).

I thought compilers have some sort of Prover phase, where mathematical identities are known that allow it to skip checks.   For example:

private static final int SIZE = 16;
private static final int MASK = 0xF;
private final T[] buf = new T[SIZE];

...
buf[i & MASK] = val

In this example, the compiler provably knows 0 <= i & MASK < 16, it knows the buf is not going to change, and the length of the array is constant.   Even outside of a loop, it seems like optimizing compilers should know this.


 


 

You can use other patterns (e.g. multi-interval buffers used in moving window applications), but the double-buffer is by far the simplest and most common.

In your use case, you would just have active and inactive instances of your ringbuffer, with the reader controlling which is which (and flipping between them to read), and the writer (the logger in your case) only ever writing to the active one. If you size your buffer large enough and the reader visits often enough (for the buffer size and write rate),  you can maintain lossless reading.

Whether or not you need allocation for the actual elements you are logging is up to you and what you are logging. You can certainly have each of the two ring buffer instance be an array of references to e.g. String, with String entries instantiated and allocated per logging event, but zero allocation recorders or loggers with this pattern are common, where each of the two instances are pre-allocated to hold the entire data needed. E.g. if a reasonable cap for a logging entry size is known, all entries in the ring buffer can be pre-allocated.

I am logging strings, but they are preallocated constants.   I'm confident there won't be any memory allocation (and use JMH's gc profiler to check this).   
 

I guess thats okay, but then the writer has to syncrhonize before each write.

The synchronization cost on the writer side is a single atomic increment on each side of a critical section. This (with a simple array based ring buffer and non-atomic manipulation of the "active caret" in that buffer since you have only a single writer) is probably much cheaper than using ConcurrentHashMap, ThreadLocal lookups, and WeakRef manipulations.

I may not have been clear, the ConcurrentHashMap and Weakref code is an index for the reader to track down each thread's data.  I think the threadlocal is still needed for contention free writing.  
 

As for "coarsening": the critical section can have as many data operations folded together in a "coarsening effect" as you want.


@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: Mostly Thread Local

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
StampedLock is super tricky.  
Consider re-reading the comments and Hans' paper.
Consider using StampedLock somehow instead of building another equally tricky synchronizer.
There were also previous discussions about fences in StampedLock on this list (or core-libs-dev).
Users of StampedLock must be resilient to bogus data in optimistic reads, including out-of-thin-air torn reads.
The data that were read only become valid after the stamp is validated.

On Tue, May 21, 2019 at 11:42 AM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
After reading some of the code in StampedLock I think I can propose something that addresses the reordering:

int[] data;
int idx;
VarHandle DATA;
VarHandle IDX;

Writer:
DATA.setOpaque(this, val);
IDX.setRelease(this, i + 1);
VarHandle.storeStoreFence();

Reader:
i1 = IDX.getAcquire(this);
VarHandle.loadLoadFence();
dataCopy = // Opaque copy of data
VarHandle.loadLoadFence();
i2 = IDX.getAcquire(this);

I think with the fences, the writer index store cannot be reordered incorrectly with the data store.  Also, the reader indexes cannot be reordered with the array copy.  The goal is to get as much of the data array as possible.  It's okay to lose some of the data between the start and end of the copy, as long as it's possible to be confident about the unchanged part of the data array.  

One thing is puzzling though:  StampedLock has a similar pattern, but uses plain reads of the data.  I am not sure how this is valid, since I thought Opaque level was needed for non tearing reads (specifically in the "informal proof" section at the top).   





On Thu, May 2, 2019 at 2:24 AM Alex Otenko <[hidden email]> wrote:
You are using methods that result in correct results on x86.

All reads are performed in program order on x86, so as long as the compiler doesn't reorder them, the effects JMM allows are not manifesting themselves. 

All writes are in total order, too.

Alex

On Thu, 2 May 2019, 06:48 Carl Mastrangelo, <[hidden email]> wrote:
As a followup, I put together a JCStress test that checks for races in my example code above.  I can't say that there aren't any races in the code I proposed, but I wasn't able to find any, and I tested several off-by-one variants of the code to prove that races do exist.  I only tested on my Skylake processor, so there are some races that I'm sure x86 is hiding from me.  I'll try finding a multicore ARM processor and run there too.

The code (30LoC) is here if anyone would like to look at it: 



On Tue, Apr 30, 2019 at 10:47 AM Carl Mastrangelo <[hidden email]> wrote:
I tried out an implementation inspired by WriterReaderPhaser with my code, and was able to get the the per method call time down to about 11ns for the two atomic increments.  This is low, but the original snippet I provided (with volatile read and lazy-set) brings the time to about 3ns.  If there is some way to keep this lower time, it would be highly preferable.   It would be good if the synchronization overhead is as low as possible.   

Comments below on the checks:

On Mon, Apr 29, 2019 at 8:15 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 4:46 PM, Carl Mastrangelo <[hidden email]> wrote:



On Mon, Apr 29, 2019 at 3:28 PM Gil Tene <[hidden email]> wrote:


On Apr 29, 2019, at 2:49 PM, Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:

@Gil: I had seen that before, though I think it is optimized for the multiproducer case. 

I guess the way I see it, the multi-producer support is just a nice side-effect. Even with a single writer, you'd need that writer to indicate critical section boundaries in so me way (or coordinate in some other way with e.g. atomically modified caret position trackers in your ring buffer) if it wants to remain wait-free while coordinating with a reader.

The issue with it is the reader needs to keep track of the old snapshots after it completes reading, because the writer will have moved on. 

The common pattern is a classic double-buffer. The writer(s) only ever interacts with the "active" data structures, and the reader will typically have one "inactive" data structure that it is working on, knowing that the inactive data is at rest because no writer(s) are interacting with it. Efficient implementations will "flip" the active and inactive versions and never allocate a new structure.

I am worried that this would affect the fast write path, since now it would have to reverify the array is not null, and reverify the bounds check for the write.  Since this write is not done in a loop, I don't think the bounds check or null check can be so easily eliminated, which it may be able to if the array was in a final variable.  

Since this (seems to be) targeting Java:

a) Not that it's material anyway, but the null check is actually "free". Modern JVMs don't actually check for nulls in the hot path. Instead, on attempting to access through null references, they catch SEGV's, deopt, and throw the exception as if the check was there. So as long as you don't actually try to use a null array, you won't see any null check costs.

The free-ness of this is in question.   I looked at the PrintAssembly output and it seems like the null check comes free because the length of the array is read. The problem is that it isn't free, at least according to the percentages from perfasm.  The mov instruction varied anywhere from 2-5% of the time of that 3ns duration.  (its low I know, but it was one of the hotter spots).
 

b) The bounds check would not be eliminated if the writes are not in a loop, but the bounds check cost is probably negligible compared to the rest of the work involved (e.g. compared to the two atomic increments in my scheme, or compared to atomic updates or whatever other synchronization mechanisms you would need to use in other schemes in order to prevent "funny" situations with the reader and the carets).

I thought compilers have some sort of Prover phase, where mathematical identities are known that allow it to skip checks.   For example:

private static final int SIZE = 16;
private static final int MASK = 0xF;
private final T[] buf = new T[SIZE];

...
buf[i & MASK] = val

In this example, the compiler provably knows 0 <= i & MASK < 16, it knows the buf is not going to change, and the length of the array is constant.   Even outside of a loop, it seems like optimizing compilers should know this.


 


 

You can use other patterns (e.g. multi-interval buffers used in moving window applications), but the double-buffer is by far the simplest and most common.

In your use case, you would just have active and inactive instances of your ringbuffer, with the reader controlling which is which (and flipping between them to read), and the writer (the logger in your case) only ever writing to the active one. If you size your buffer large enough and the reader visits often enough (for the buffer size and write rate),  you can maintain lossless reading.

Whether or not you need allocation for the actual elements you are logging is up to you and what you are logging. You can certainly have each of the two ring buffer instance be an array of references to e.g. String, with String entries instantiated and allocated per logging event, but zero allocation recorders or loggers with this pattern are common, where each of the two instances are pre-allocated to hold the entire data needed. E.g. if a reasonable cap for a logging entry size is known, all entries in the ring buffer can be pre-allocated.

I am logging strings, but they are preallocated constants.   I'm confident there won't be any memory allocation (and use JMH's gc profiler to check this).   
 

I guess thats okay, but then the writer has to syncrhonize before each write.

The synchronization cost on the writer side is a single atomic increment on each side of a critical section. This (with a simple array based ring buffer and non-atomic manipulation of the "active caret" in that buffer since you have only a single writer) is probably much cheaper than using ConcurrentHashMap, ThreadLocal lookups, and WeakRef manipulations.

I may not have been clear, the ConcurrentHashMap and Weakref code is an index for the reader to track down each thread's data.  I think the threadlocal is still needed for contention free writing.  
 

As for "coarsening": the critical section can have as many data operations folded together in a "coarsening effect" as you want.


@Alex:  Can you explain why?  I'm trying to understand the flaw in my reasoning.  From my PoV, as long as System.arraycopy copies the data in (Release|Opaque|Volatile), it should be sufficient.  My thought process:

Suppose the buffer is size 16, and the write idx is at 20.

T1(writer):  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2(reader): Read volatile idx
T2: Start arraycopy of buf

T1:  Read idx
T1: Write buf[idx&0xF]
T1: Write release idx + 1

T2: finish array copy
T2: Read volatile idx


From T2's perspective, idx2 - idx1  would be how many entrees after idx1 are racy garbage values.  All other values, [0-3], [5-15]  were correctly synchronized by the released write and volatile read of idx.  




On Mon, Apr 29, 2019 at 2:16 PM Alex Otenko <[hidden email]> wrote:
No, this is totally broken.

You need more barriers after arraycopy and before end=....getVolatile

setRelease is also suspicious, not obvious it is doing what you want. 

Alex


On Mon, 29 Apr 2019, 18:48 Carl Mastrangelo via Concurrency-interest, <[hidden email]> wrote:
Thanks for the ideas.   I put together a proof of concept where the writer writes to a ring buffer, but publishes the write index after each write.  I am not sure it's threadsafe:

WriterThread:
i = idxHandle.get(this)
i %= SIZE;
buf[i] = val;
idxHandle.setRelease(this, i + 1);

ReaderThread:
T[] copyBuf = new T[SIZE];
start = idxHandle.getVolatile(this);
System.arraycopy(buf, 0, copyBuf, 0, SIZE);
end = idxHandle.getVolatile(this);


I know this is racy, but I *think* it may be okay.   As long as the reader ignores the elements between start and end (and assuming end - start > SIZE), it seems like the other elements copied from buf are safely published.


On Thu, Apr 25, 2019 at 2:21 PM Josh Humphries <[hidden email]> wrote:
Does the reader read the entire ring buffer? Or does it try track some sort of cursor and try to only read newly inserted items?

If reading the entire ring buffer at once, I can think of a couple of ways that I believe would work:

Allocate entry with each write
If you allocate a new record every time you write an entry into the buffer, you can use VarHandles or AtomicReferenceArray to do volatile reads of each entry. The writer must of course use a volatile write for each store.

Don't allocate new entries
If you're wanting to not allocate entries but just mutate existing objects in the buffer that get pre-allocated, you can use a stamp field on each entry. Since there is only one writer, you don't need extra synchronization around read-modify-write sequences on the stamp. You just need to end the sequence with a volatile write. So the writer reads the stamp (call this value "initial stamp"), then negates it (or bitwise-invert), and volatile writes the new value. (So the stamp's sign bit is a dirty flag, indicating that the writer is concurrently making changes). The writer then updates all of the fields in the entry (no need for volatile writes). Finally, it volatile writes the stamp to "initial stamp plus one".

Readers need to (volatile) read the stamp at the start, then read all of the relevant fields (which can be plain reads, not volatile ones), then (volatile) re-read the stamp at the end. If the stamp changed between the two reads, the field values read may be garbage/inconsistent, so the reader must try again. This would go into a loop until the read is successful (same stamp values before and after).

The writer is non-blocking and non-locking. Readers are non-locking (but not necessarily non-blocking since they may need to perform nondeterministic number of re-reads). If the critical section is not short, readers will be busy waiting for writers to finish modifying an entry (so they should likely Thread.yield() periodically, perhaps even after every failed read attempt).

If you wanted to further reduce volatile writes, you could make a sharded ring buffer: break the buffer up into N buffers. The writer round-robins across shards. Then you could put the stamp on the shard, not on individual entries. This would allow the writer to potentially batch up writes by dirtying the stamp on a shard, recording multiple entries (which do not need to use volatile writes), and then setting the new stamp value. This means longer critical sections, though, so more wasted busy-wait cycles in readers potentially. (Introducing a signaling mechanism seems like it would greatly complicate the scheme and may require introduction of other synchronizers, which seems to defeat the point.)



For both of the above cases, after the reader scans the entire buffer, it would need to re-order the results based on something (like a monotonic counter/ID on each row, recorded by the writer) to reconstruct the correct order. There is the possibility, of course, that the reader misses some items or even has "holes" in the sequence of entries it reads.

----
Josh Humphries
[hidden email]


On Thu, Apr 25, 2019 at 4:29 PM Carl Mastrangelo via Concurrency-interest <[hidden email]> wrote:
Hi,

I am looking for a synchronization pattern where there is a single writer and 0-1 readers.  Writes happen much more frequently than reads, and in fact reads may never happen.  

The use case is a simple logger which writes to a threadlocal ring buffer, which overwrites stale entries.  The reader comes in occasionally to read the buffer, but is in no rush. I am wondering if there is a way to make the writer lock free or wait free, while putting all the synchronization burden on the reader.  

My current approach is a ConcurrentHashMap, with a WeakRef for keys/values, each pointing to a ThreadLocal.   Writes and reads on the ThreadLocal are done using plain old `synchronized`, which are mostly uncontended.  

Questions:

1.  Is there a faster way to implemented write-heavy, single producer code?   

2.  If the writes happen in quick succession, will lock coarsening reduce the number of synchronization points?   (i.e. is it possible to do better than volatile reads using `synchronized`?)

3.  Is there a "lease" like synchronization pattern where a thread can request access to data for a limited time without syncrhonizing each modification?  I was thinking of maybe using System.nanoTime() with some safety bounds, since I have to make the nanoTime call anyways.
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest