Effects of swapping buffers on concurrent access - multithreading

Consider an application with two threads, Producer and Consumer.
Both threads are running approximately equally frequent, multiple times in a second.
Both threads access the same memory region, where Producer writes to the memory, and Consumer reads the current chunk of data and does something with it, without invalidating the data.
A classical approach is this one:
int[] sharedData;
//Called frequently by thread Producer
void WriteValues(int[] data)
{
lock(sharedData)
{
Array.Copy(data, sharedData, LENGTH);
}
}
//Called frequently by thread Consumer
void WriteValues()
{
int[] data;
lock(sharedData)
{
Array.Copy(sharedData, data, LENGTH);
}
DoSomething(data);
}
If we assume that the Array.Copy takes time, this code would run slow, since Producer always has to wait for Consumer during copying and vice versa.
An approach to this problem would be to create two buffers, one which is accessed by the Consumer, and one which is written to by the Producer, and swap the buffers, as soon as writing has finished.
int[] frontBuffer;
int[] backBuffer;
//Called frequently by thread Producer
void WriteValues(int[] data)
{
lock(backBuffer)
{
Array.Copy(data, backBuffer, LENGTH);
int[] temp = frontBuffer;
frontBuffer = backBuffer;
backBuffer = temp;
}
}
//Called frequently by thread Consumer
void WriteValues()
{
int[] data;
int[] currentFrontBuffer = frontBuffer;
lock(currentForntBuffer)
{
Array.Copy(currentFrontBuffer , data, LENGTH);
}
DoSomething(currentForntBuffer );
}
Now, my questions:
Is locking, as shown in the 2nd example, safe? Or does the change of references introduce problems?
Will the code in the 2nd example execute faster than the code in the 1st example?
Are there any better methods to efficiently solve the problem described above?
Could there be a way to solve this problem without locks? (Even if I think it is impossible)
Note: this is no classical producer/consumer problem: It is possible for Consumer to read the values multiple times before Producer writes it again - the old data stays valid until Producer writes new data.

Is locking, as shown in the 2nd example, safe? Or does the change of references introduce problems?
As far as I can tell, because reference assignment is atomic, this may be safe but not ideal. Because the WriteValues() method reads from frontBuffer without a lock or memory barrier forcing a cache refresh, there no guarantee that the variable will ever be updated with new values from main memory. There is then a potential to continuously read the stale, cached values of that instance from the local register or CPU cache. I'm unsure of whether the compiler/JIT might infer a cache refresh anyway based on the local variable, maybe somebody with more specific knowledge can speak to this area.
Even if the values aren't stale, you may also run into more contention than you would like. For example...
Thread A calls WriteValues()
Thread A takes a lock on the instance in frontBuffer and starts copying.
Thread B calls WriteValues(int[])
Thread B writes its data, moves the currently locked frontBuffer instance into backBuffer.
Thread B calls WriteValues(int[])
Thread B waits on the lock for backBuffer because Thread A still has it.
Will the code in the 2nd example execute faster than the code in the 1st example?
I suggest that you profile it and find out. X being faster than Y only matters if Y is too slow for your particular needs, and you are the only one who knows what those are.
Are there any better methods to efficiently solve the problem described above?
Yes. If you are using .Net 4 and above, there is a BlockingCollection type in System.Collections.Concurrent that models the Producer/Consumer pattern well. If you consistently read more than you write, or have multiple readers to very few writers, you may also want to consider the ReaderWriterLockSlim class. As a general rule of thumb, you should do as little within a lock as you can, which will also help to alleviate your time issue.
Could there be a way to solve this problem without locks? (Even if I think it is impossible)
You might be able to, but I wouldn't suggest trying that unless you are extremely familiar with multi-threading, cache coherency, and potential compiler/JIT optimizations. Locking will most likely be fine for your situation and it will be much easier for you (and others reading your code) to reason about and maintain.

Related

How/when to release memory in wait-free algorithms

I'm having trouble figuring out a key point in wait-free algorithm design. Suppose a data structure has a pointer to another data structure (e.g. linked list, tree, etc), how can the right time for releasing a data structure?
The problem is this, there are separate operations that can't be executed atomically without a lock. For example one thread reads the pointer to some memory, and increments the use count for that memory to prevent free while this thread is using the data, which might take long, and even if it doesn't, it's a race condition. What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count?
The main issue is that current CPUs only have a single word CAS (compare & swap). Alternatively the problem is that I'm clueless about waitfree algorithms and data structures and after reading some papers I'm still not seeing the light.
IMHO Garbage collection can't be the answer, because it would either GC would have to be prevented from running if any single thread is inside an atomic block (which would mean it can't be guaranteed that the GC will ever run again) or the problem is simply pushed to the GC, in which case, please explain how the GC would figure out if the data is in the silly state (a pointer is read [e.g. stored in a local variable] but the the use count didn't increment yet).
PS, references to advanced tutorials on wait-free algorithms for morons are welcome.
Edit: You should assume that the problem is being solved in a non-managed language, like C or C++. After all if it were Java, we'd have no need to worry about releasing memory. Further assume that the compiler may generate code that will store temporary references to objects in registers (invisible to other threads) right before the usage counter increment, and that a thread can be interrupted between loading the object address and incrementing the counter. This of course doesn't mean that the solution must be limited to C or C++, rather that the solution should give a set of primitives that allowing the implementation of wait-free algorithms on linked data structures. I'm interested in the primitives and how they solve the problem of designing wait-free algorithms. With such primitives a wait-free algorithm can be implemented equally well in C++ and Java.
After some research I learned this.
The problem is not trivial to solve and there are several solutions each with advantages and disadvantages. The reason for the complexity comes from inter CPU synchronization issues. If not done right it might appear to work correctly 99.9% of the time, which isn't enough, or it might fail under load.
Three solutions that I found are 1) hazard pointers, 2) quiescence period based reclamation (used by the Linux kernel in the RCU implementation) 3) reference counting techniques. 4) Other 5) Combinations
Hazard pointers work by saving the currently active references in a well-known per thread location, so any thread deciding to free memory (when the counter appears to be zero) can check if the memory is still in use by anyone. An interesting improvement is to buffer request to release memory in a small array and free them up in a batch when the array is full. The advantage of using hazard pointers is that it can actually guarantee an upper bound on unreclaimed memory. The disadvantage is that it places extra burden on the reader.
Quiescence period based reclamation works by delaying the actual release of the memory until it's known that each thread has had a chance to finish working on any data that may need to be released. The way to know that this condition is satisfied is to check if each thread passed through a quiescent period (not in a critical section) after the object was removed. In the Linux kernel this means something like each task making a voluntary task switch. In a user space application it would be the end of a critical section. This can be achieved by a simple counter, each time the counter is even the thread is not in a critical section (reading shared data), each time the counter is odd the thread is inside a critical section, to move from a critical section or back all the thread needs to do is to atomically increment the number. Based on this the "garbage collector" can determine if each thread has had a chance to finish. There are several approaches, one simple one would be to queue up the requests to free memory (e.g. in a linked list or an array), each with the current generation (managed by the GC), when the GC runs it checks the state of the threads (their state counters) to see if each passed to the next generation (their counter is higher than the last time or is the same and even), any memory can be reclaimed one generation after it was freed. The advantage of this approach is that is places the least burden on the reading threads. The disadvantage is that it can't guarantee an upper bound for the memory waiting to be released (e.g. one thread spending 5 minutes in a critical section, while the data keeps changing and memory isn't released), but in practice it works out all right.
There is a number of reference counting solutions, many of them require double compare and swap, which some CPUs don't support, so can't be relied upon. The key problem remains though, taking a reference before updating the counter. I didn't find enough information to explain how this can be done simply and reliably though. So .....
There are of course a number of "Other" solutions, it's a very important topic of research with tons of papers out there. I didn't examine all of them. I only need one.
And of course the various approaches can be combined, for example hazard pointers can solve the problems of reference counting. But there's a nearly infinite number of combinations, and in some cases a spin lock might theoretically break wait-freedom, but doesn't hurt performance in practice. Somewhat like another tidbit I found in my research, it's theoretically not possible to implement wait-free algorithms using compare-and-swap, that's because in theory (purely in theory) a CAS based update might keep failing for non-deterministic excessive times (imagine a million threads on a million cores each trying to increment and decrement the same counter using CAS). In reality however it rarely fails more than a few times (I suspect it's because the CPUs spend more clocks away from CAS than there are CPUs, but I think if the algorithm returned to the same CAS on the same location every 50 clocks and there were 64 cores there could be a chance of a major problem, then again, who knows, I don't have a hundred core machine to try this). Another results of my research is that designing and implementing wait-free algorithms and data-structures is VERY challenging (even if some of the heavy lifting is outsourced, e.g. to a garbage collector [e.g. Java]), and might perform less well than a similar algorithm with carefully placed locks.
So, yeah, it's possible to free memory even without delays. It's just tricky. And if you forget to make the right operations atomic, or to place the right memory barrier, oh, well, you're toast. :-) Thanks everyone for participating.
I think atomic operations for increment/decrement and compare-and-swap would solve this problem.
Idea:
All resources have a counter which is modified with atomic operations. The counter is initially zero.
Before using a resource: "Acquire" it by atomically incrementing its counter. The resource can be used if and only if the incremented value is greater than zero.
After using a resource: "Release" it by atomically decrementing its counter. The resource should be disposed/freed if and only if the decremented value is equal to zero.
Before disposing: Atomically compare-and-swap the counter value with the minimum (negative) value. Dispose will not happen if a concurrent thread "Acquired" the resource in between.
You haven't specified a language for your question. Here goes an example in c#:
class MyResource
{
// Counter is initially zero. Resource will not be disposed until it has
// been acquired and released.
private int _counter;
public bool Acquire()
{
// Atomically increment counter.
int c = Interlocked.Increment(ref _counter);
// Resource is available if the resulting value is greater than zero.
return c > 0;
}
public bool Release()
{
// Atomically decrement counter.
int c = Interlocked.Decrement(ref _counter);
// We should never reach a negative value
Debug.Assert(c >= 0, "Resource was released without being acquired");
// Dispose when we reach zero
if (c == 0)
{
// Mark as disposed by setting counter its minimum value.
// Only do this if the counter remain at zero. Atomic compare-and-swap operation.
if (Interlocked.CompareExchange(ref _counter, int.MinValue, c) == c)
{
// TODO: Run dispose code (free stuff)
return true; // tell caller that resource is disposed
}
}
return false; // released but still in use
}
}
Usage:
// "r" is an instance of MyResource
bool acquired = false;
try
{
if (acquired = r.Acquire())
{
// TODO: Use resource
}
}
finally
{
if (acquired)
{
if (r.Release())
{
// Resource was disposed.
// TODO: Nullify variable or similar to let GC collect it.
}
}
}
I know this is not the best way but it works for me:
for shared dynamic data-structure lists I use usage counter per item
for example:
struct _data
{
DWORD usage;
bool delete;
// here add your data
_data() { usage=0; deleted=true; }
};
const int MAX = 1024;
_data data[MAX];
now when item is started to be used somwhere then
// start use of data[i]
data[i].cnt++;
after is no longer used then
// stop use of data[i]
data[i].cnt--;
if you want to add new item to list then
// add item
for (i=0;i<MAX;i++) // find first deleted item
if (data[i].deleted)
{
data[i].deleted=false;
data[i].cnt=0;
// copy/set your data
break;
}
and now in the background once in a while (on timer or whatever)
scann data[] an all undeleted items with cnt == 0 set as deleted (+ free its dynamic memory if it has any)
[Note]
to avoid multi-thread access problems implement single global lock per data list
and program it so you cannot scann data while any data[i].cnt is changing
one bool and one DWORD suffice for this if you do not want to use OS locks
// globals
bool data_cnt_locked=false;
DWORD data_cnt=0;
now any change of data[i].cnt modify like this:
// start use of data[i]
while (data_cnt_locked) Sleep(1);
data_cnt++;
data[i].cnt++;
data_cnt--;
and modify delete scan like this
while (data_cnt) Sleep(1);
data_cnt_locked=true;
Sleep(1);
if (data_cnt==0) // just to be sure
for (i=0;i<MAX;i++) // here scan for items to delete ...
if (!data[i].cnt)
if (!data[i].deleted)
{
data[i].deleted=true;
data[i].cnt=0;
// release your dynamic data ...
}
data_cnt_locked=false;
PS.
do not forget to play with the sleep times a little to suite your needs
lock free algorithm sleep times are sometimes dependent on OS task/scheduler
this is not really an lock free implementation
because while GC is at work then all is locked
but if ather than that multi access is not blocking to each other
so if you do not run GC too often you are fine

<Spring Batch> Why does making ItemReader thread-safe leads us to loosing restartability?

I have a multi-threaded batch job reading from a DB and I am concerned about different threads re-reading records as ItemReader is not thread safe in Spring batch. I went through SpringBatch FAQ section which states that
You can synchronize the read() method (e.g. by wrapping it in a delegator that does the synchronization). Remember that you will lose restartability, so best practice is to mark the step as not restartable and to be safe (and efficient) you can also set saveState=false on the reader.
I want to know why will I loose re-startability in this case? What has restartability got to do with synchronizing my read operations? It can always try again,right?
Also, will this piece of code be enough for synchronizing the reader?
public SynchronizedItemReader<T> implements ItemReader<T> {
private final ItemReader<T> delegate;
public SynchronizedItemReader(ItemReader<T> delegate) {
this.delegate = delegate;
}
public synchronized T read () {
return delegate.read();
}
}
When using an ItemReader with multithreads, the lack of restartability is not about the read itself. It's about saving the state of the reader which occurs in the update method. The issue is that there needs to be coordination between the calls to read() - the method providing the data and update() - the method persisting the state. When you use multiple threads, the internal state of the reader (and therefore the update() call) may or may not reflect the work that has been done. Take for example the FlatFileItemReader using a chunk size of 5 and running on multiple threads. You could have thread1 having read 5 items (time to update), yet thread 2 could have read an additional 3. This means that the call to update would save that 8 items have been read. If the chunk on thread 2 fails, the state would due incorrect and the restart would miss the three items that were already read.
This is not to say that it is impossible to write a thread safe ItemReader. However, as your example above illustrates, if delegate is a stateful ItemReader (implements ItemStream as well), the state will not be persisted correctly with calls to update (in fact, your example above doesn't even take the ItemStream aspect of stageful readers into account).
If you want make restartable your job, with parallel execution of items, you can save item, that reader read plus state of this item by yourself.

Multi-threaded application, parrarel reading with write possibility

For example, I have multi-threaded application which can be presented as:
Data bigData;
void thread1()
{
workOn(bigData);
}
void thread2()
{
workOn(bigData);
}
void thread3()
{
workOn(bigData);
}
There are few threads that are working on data. I could leave it as it is, but the problem is that sometimes (very seldom) data are modified by thread4.
void thread4()
{
sometimesModifyData(bigData);
}
Critical sections could be added there, but it would make no sense to multi-threading, because only one thread could work on data at the same time.
What is the best method to make it sense multi-threading while making it thread safe?
I am thinking about kind of state (sempahore?), that would prevent reading and writing at the same time but would allow parallel reading.
This is called a readers–writer lock. You could implement what is called a mutex to make sure no one reads when write is going on and no one writes when reads are going on. One way to solve the problem would be to have flags. If the writer is got something to modify, then switch on a lock. Upon which NO MORE readers will get to read and after all the current readers have finished, the writer will get to do its job and then again the readers read.

multithreading: how to process data in a vector, while the vector is being populated?

I have a single-threaded linux app which I would like to make parallel. It reads a data file, creates objects, and places them in a vector. Then it calls a compute-intensive method (.5 second+) on each object. I want to call the method in parallel with object creation. While I've looked at qt and tbb, I am open to other options.
I planned to start the thread(s) while the vector was empty. Each one would call makeSolids (below), which has a while loop that would run until interpDone==true and all objects in the vector have been processed. However, I'm a n00b when it comes to threading, and I've been looking for a ready-made solution.
QtConcurrent::map(Iter begin,Iter end,function()) looks very easy, but I can't use it on a vector that's changing in size, can I? And how would I tell it to wait for more data?
I also looked at intel's tbb, but it looked like my main thread would halt if I used parallel_for or parallel_while. That stinks, since their memory manager was recommended (open cascade's mmgt has poor performance when multithreaded).
/**intended to be called by a thread
\param start the first item to get from the vector
\param skip how many to skip over (4 for 4 threads)
*/
void g2m::makeSolids(uint start, uint incr) {
uint curr = start;
while ((!interpDone) || (lineVector.size() > curr)) {
if (lineVector.size() > curr) {
if (lineVector[curr]->isMotion()) {
((canonMotion*)lineVector[curr])->setSolidMode(SWEPT);
((canonMotion*)lineVector[curr])->computeSolid();
}
lineVector[curr]->setDispMode(BEST);
lineVector[curr]->display();
curr += incr;
} else {
uio::sleep(); //wait a little bit for interp
}
}
}
EDIT: To summarize, what's the simplest way to process a vector at the same time that the main thread is populating the vector?
Firstly, to benefit from threading you need to find similarly slow tasks for each thread to do. You said your per-object processing takes .5s+, how long does your file reading / object creation take? It could easily be a tenth or a thousandth of that time, in which case your multithreading approach is going to produce neglegible benefit. If that's the case, (yes, I'll answer your original question soon incase it's not) then think about simultaneously processing multiple objects. Given your processing takes quite a while, the thread creation overhead isn't terribly significant, so you could simply have your main file reading/object creation thread spawn a new thread and direct it at the newly created object. The main thread then continues reading/creating subsequent objects. Once all objects are read/created, and all the processing threads launched, the main thread "joins" (waits for) the worker threads. If this will create too many threads (thousands), then put a limit on how far ahead the main thread is allowed to get: it might read/create 10 objects then join 5, then read/create 10, join 10, read/create 10, join 10 etc. until finished.
Now, if you really want the read/create to be in parallel with the processing, but the processing to be serialised, then you can still use the above approach but join after each object. That's kind of weird if you're designing this with only this approach in mind, but good because you can easily experiment with the object processing parallelism above as well.
Alternatively, you can use a more complex approach that just involves the main thread (that the OS creates when your program starts), and a single worker thread that the main thread must start. They should be coordinated using a mutex (a variable ensuring mutually-exclusive, which means not-concurrent, access to data), and a condition variable which allows the worker thread to efficiently block until the main thread has provided more work. The terms - mutex and condition variable - are the standard terms in the POSIX threading that Linux uses, so should be used in the explanation of the particular libraries you're interested in. Summarily, the worker thread waits until the main read/create thread broadcasts it a wake-up signal indicating another object is ready for processing. You may want to have a counter with index of the last fully created, ready-for-processing object, so the worker thread can maintain it's count of processed objects and move along the ready ones before once again checking the condition variable.
It's hard to tell if you have been thinking about this problem deeply and there is more than you are letting on, or if you are just over thinking it, or if you are just wary of threading.
Reading the file and creating the objects is fast; the one method is slow. The dependency is each consecutive ctor depends on the outcome of the previous ctor - a little odd - but otherwise there are no data integrity issues so there doesn't seem to be anything that needs to be protected by mutexes and such.
Why is this more complicated than something like this (in crude pseudo-code):
while (! eof)
{
readfile;
object O(data);
push_back(O);
pthread_create(...., O, makeSolid);
}
while(x < vector.size())
{
pthread_join();
x++;
}
If you don't want to loop on the joins in your main then spawn off a thread to wait on them by passing a vector of TIDs.
If the number of created objects/threads is insane, use a thread pool. Or put a counter is the creation loop to limit the number of threads that can be created before running ones are joined.
#Caleb: quite -- perhaps I should have emphasized active threads. The GUI thread should always be considered one.

Primitive synchronization primitives -- safe?

On constrained devices, I often find myself "faking" locks between 2 threads with 2 bools. Each is only read by one thread, and only written by the other. Here's what I mean:
bool quitted = false, paused = false;
bool should_quit = false, should_pause = false;
void downloader_thread() {
quitted = false;
while(!should_quit) {
fill_buffer(bfr);
if(should_pause) {
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
}
}
quitted = true;
}
void ui_thread() {
// new Thread(downloader_thread).start();
// ...
should_pause = true;
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
}
Of course on a PC I wouldn't do this, but on constrained devices, it seems reading a bool value would be much quicker than obtaining a lock. Of course I trade off for slower recovery (see "sleep(50)") when a change to the buffer is needed.
The question -- is it completely thread-safe? Or are there hidden gotchas I need to be aware of when faking locks like this? Or should I not do this at all?
Using bool values to communicate between threads can work as you intend, but there are indeed two hidden gotchas as explained in this blog post by Vitaliy Liptchinsky:
Cache Coherency
A CPU does not always fetch memory values from RAM. Fast memory caches on the die are one of the tricks used by CPU designers to work around the Von Neumann bottleneck. On some multi-cpu or multi-core architectures (like Intel's Itanium) these CPU caches are not shared or automatically kept in sync. In other words, your threads may be seeing different values for the same memory address if they run on different CPU's.
To avoid this you need to declare your variables as volatile (C++, C#, java), or do explicit volatile read/writes, or make use of locking mechanisms.
Compiler Optimizations
The compiler or JITter may perform optimizations which are not safe if multiple threads are involved. See the linked blog post for an example. Again, you must make use of the volatile keyword or other mechanisms to inform you compiler.
Unless you understand the memory architecture of your device in detail, as well as the code generated by your compiler, this code is not safe.
Just because it seems that it would work, doesn't mean that it will. "Constrained" devices, like the unconstrained type, are getting more and more powerful. I wouldn't bet against finding a dual-core CPU in a cell phone, for instance. That means I wouldn't bet that the above code would work.
Concerning the sleep call, you could always just do sleep(0) or the equivalent call that pauses your thread letting the next in line a turn.
Concerning the rest, this is thread safe if you know the implementation details of your device.
Answering the questions.
Is this completely thread safe? I would answer no this is not thread safe and I would just not do this at all. Without knowing the details of our device and compiler, if this is C++, the compiler is free to reorder and optimize things away as it sees fit. e.g. you wrote:
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
but the compiler may choose to reorder this into something like this:
sleep(50);
is_paused = false;
this probably won't work even a single core device as others have said.
Rather than taking a lock, you may try to do better to just do less on the UI thread rather than yield in the middle of processing UI messages. If you think that you have spent too much time on the UI thread then find a way to cleanly exit and register an asynchronous call back.
If you call sleep on a UI thread (or try to acquire a lock or do anyting that may block) you open the door to hangs and glitchy UIs. A 50ms sleep is enough for a user to notice. And if you try to acquire a lock or do any other blocking operation (like I/O) you need to deal with the reality of waiting for an indeterminate amount of time to get the I/O which tends to translate from glitch to hang.
This code is unsafe under almost all circumstances. On multi-core processors you will not have cache coherency between cores because bool reads and writes are not atomic operations. This means each core is not guarenteed to have the same value in the cache or even from memory if the cache from the last write hasn't been flushed.
However, even on resource constrained single core devices this is not safe because you do not have control over the scheduler. Here is an example, for simplicty I'm going to pretend these are the only two threads on the device.
When the ui_thread runs, the following lines of code could be run in the same timeslice.
// new Thread(downloader_thread).start();
// ...
should_pause = true;
The downloader_thread runs next and in it's time slice the following lines are executed:
quitted = false;
while(!should_quit)
{
fill_buffer(bfr);
The scheduler prempts the downloader_thread before fill_buffer returns and then activates the ui_thread which runs.
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
The resize buffer operation is done while the downloader_thread is in the process of filling the buffer. This means the buffer is corrupted and you'll likely crash soon. It won't happen everytime, but the fact that you are filling the buffer before you set is_paused to true makes it more likely to happen, but even if you switched the order of those two operations on the downloader_thread you would still have a race condition, but you'd likely deadlock instead of corrupting the buffer.
Incidentally, this is a type of spinlock, it just doesn't work. Spinlock's aren't very for wait times that are likely to span to many time slices cause the spin the processor. Your implmentation does sleep which is a bit nicer but the scheduler still has to run your thread and thread context switches aren't cheap. If you are waiting on a critical section or semaphore, the scheduler doesn't active your thread again till the resource has become free.
You might be able to get away with this in some form on a specific platform/architecture, but it is really easy to make a mistake that is very hard to track down.

Resources