Golang: Best way to read from a hashmap w/ mutex - hashmap

This is a continuation from here: Golang: Shared communication in async http server
Assuming I have a hashmap w/ locking:
//create async hashmap for inter request communication
type state struct {
*sync.Mutex // inherits locking methods
AsyncResponses map[string]string // map ids to values
}
var State = &state{&sync.Mutex{}, map[string]string{}}
Functions that write to this will place a lock. My question is, what is the best / fastest way to have another function check for a value without blocking writes to the hashmap? I'd like to know the instant a value is present on it.
MyVal = State.AsyncResponses[MyId]

Reading a shared map without blocking writers is the very definition of a data race. Actually, semantically it is a data race even when the writers will be blocked during the read! Because as soon as you finish reading the value and unblock the writers - the value may not exists in the map anymore.
Anyway, it's not very likely that proper syncing would be a bottleneck in many programs. A non-blocking lock af a {RW,}Mutex is probably in the order of < 20 nsecs even on middle powered CPUS. I suggest to postpone optimization not only after making the program correct, but also after measuring where the major part of time is being spent.

Related

How/when to release memory in wait-free algorithms

I'm having trouble figuring out a key point in wait-free algorithm design. Suppose a data structure has a pointer to another data structure (e.g. linked list, tree, etc), how can the right time for releasing a data structure?
The problem is this, there are separate operations that can't be executed atomically without a lock. For example one thread reads the pointer to some memory, and increments the use count for that memory to prevent free while this thread is using the data, which might take long, and even if it doesn't, it's a race condition. What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count?
The main issue is that current CPUs only have a single word CAS (compare & swap). Alternatively the problem is that I'm clueless about waitfree algorithms and data structures and after reading some papers I'm still not seeing the light.
IMHO Garbage collection can't be the answer, because it would either GC would have to be prevented from running if any single thread is inside an atomic block (which would mean it can't be guaranteed that the GC will ever run again) or the problem is simply pushed to the GC, in which case, please explain how the GC would figure out if the data is in the silly state (a pointer is read [e.g. stored in a local variable] but the the use count didn't increment yet).
PS, references to advanced tutorials on wait-free algorithms for morons are welcome.
Edit: You should assume that the problem is being solved in a non-managed language, like C or C++. After all if it were Java, we'd have no need to worry about releasing memory. Further assume that the compiler may generate code that will store temporary references to objects in registers (invisible to other threads) right before the usage counter increment, and that a thread can be interrupted between loading the object address and incrementing the counter. This of course doesn't mean that the solution must be limited to C or C++, rather that the solution should give a set of primitives that allowing the implementation of wait-free algorithms on linked data structures. I'm interested in the primitives and how they solve the problem of designing wait-free algorithms. With such primitives a wait-free algorithm can be implemented equally well in C++ and Java.
After some research I learned this.
The problem is not trivial to solve and there are several solutions each with advantages and disadvantages. The reason for the complexity comes from inter CPU synchronization issues. If not done right it might appear to work correctly 99.9% of the time, which isn't enough, or it might fail under load.
Three solutions that I found are 1) hazard pointers, 2) quiescence period based reclamation (used by the Linux kernel in the RCU implementation) 3) reference counting techniques. 4) Other 5) Combinations
Hazard pointers work by saving the currently active references in a well-known per thread location, so any thread deciding to free memory (when the counter appears to be zero) can check if the memory is still in use by anyone. An interesting improvement is to buffer request to release memory in a small array and free them up in a batch when the array is full. The advantage of using hazard pointers is that it can actually guarantee an upper bound on unreclaimed memory. The disadvantage is that it places extra burden on the reader.
Quiescence period based reclamation works by delaying the actual release of the memory until it's known that each thread has had a chance to finish working on any data that may need to be released. The way to know that this condition is satisfied is to check if each thread passed through a quiescent period (not in a critical section) after the object was removed. In the Linux kernel this means something like each task making a voluntary task switch. In a user space application it would be the end of a critical section. This can be achieved by a simple counter, each time the counter is even the thread is not in a critical section (reading shared data), each time the counter is odd the thread is inside a critical section, to move from a critical section or back all the thread needs to do is to atomically increment the number. Based on this the "garbage collector" can determine if each thread has had a chance to finish. There are several approaches, one simple one would be to queue up the requests to free memory (e.g. in a linked list or an array), each with the current generation (managed by the GC), when the GC runs it checks the state of the threads (their state counters) to see if each passed to the next generation (their counter is higher than the last time or is the same and even), any memory can be reclaimed one generation after it was freed. The advantage of this approach is that is places the least burden on the reading threads. The disadvantage is that it can't guarantee an upper bound for the memory waiting to be released (e.g. one thread spending 5 minutes in a critical section, while the data keeps changing and memory isn't released), but in practice it works out all right.
There is a number of reference counting solutions, many of them require double compare and swap, which some CPUs don't support, so can't be relied upon. The key problem remains though, taking a reference before updating the counter. I didn't find enough information to explain how this can be done simply and reliably though. So .....
There are of course a number of "Other" solutions, it's a very important topic of research with tons of papers out there. I didn't examine all of them. I only need one.
And of course the various approaches can be combined, for example hazard pointers can solve the problems of reference counting. But there's a nearly infinite number of combinations, and in some cases a spin lock might theoretically break wait-freedom, but doesn't hurt performance in practice. Somewhat like another tidbit I found in my research, it's theoretically not possible to implement wait-free algorithms using compare-and-swap, that's because in theory (purely in theory) a CAS based update might keep failing for non-deterministic excessive times (imagine a million threads on a million cores each trying to increment and decrement the same counter using CAS). In reality however it rarely fails more than a few times (I suspect it's because the CPUs spend more clocks away from CAS than there are CPUs, but I think if the algorithm returned to the same CAS on the same location every 50 clocks and there were 64 cores there could be a chance of a major problem, then again, who knows, I don't have a hundred core machine to try this). Another results of my research is that designing and implementing wait-free algorithms and data-structures is VERY challenging (even if some of the heavy lifting is outsourced, e.g. to a garbage collector [e.g. Java]), and might perform less well than a similar algorithm with carefully placed locks.
So, yeah, it's possible to free memory even without delays. It's just tricky. And if you forget to make the right operations atomic, or to place the right memory barrier, oh, well, you're toast. :-) Thanks everyone for participating.
I think atomic operations for increment/decrement and compare-and-swap would solve this problem.
Idea:
All resources have a counter which is modified with atomic operations. The counter is initially zero.
Before using a resource: "Acquire" it by atomically incrementing its counter. The resource can be used if and only if the incremented value is greater than zero.
After using a resource: "Release" it by atomically decrementing its counter. The resource should be disposed/freed if and only if the decremented value is equal to zero.
Before disposing: Atomically compare-and-swap the counter value with the minimum (negative) value. Dispose will not happen if a concurrent thread "Acquired" the resource in between.
You haven't specified a language for your question. Here goes an example in c#:
class MyResource
{
// Counter is initially zero. Resource will not be disposed until it has
// been acquired and released.
private int _counter;
public bool Acquire()
{
// Atomically increment counter.
int c = Interlocked.Increment(ref _counter);
// Resource is available if the resulting value is greater than zero.
return c > 0;
}
public bool Release()
{
// Atomically decrement counter.
int c = Interlocked.Decrement(ref _counter);
// We should never reach a negative value
Debug.Assert(c >= 0, "Resource was released without being acquired");
// Dispose when we reach zero
if (c == 0)
{
// Mark as disposed by setting counter its minimum value.
// Only do this if the counter remain at zero. Atomic compare-and-swap operation.
if (Interlocked.CompareExchange(ref _counter, int.MinValue, c) == c)
{
// TODO: Run dispose code (free stuff)
return true; // tell caller that resource is disposed
}
}
return false; // released but still in use
}
}
Usage:
// "r" is an instance of MyResource
bool acquired = false;
try
{
if (acquired = r.Acquire())
{
// TODO: Use resource
}
}
finally
{
if (acquired)
{
if (r.Release())
{
// Resource was disposed.
// TODO: Nullify variable or similar to let GC collect it.
}
}
}
I know this is not the best way but it works for me:
for shared dynamic data-structure lists I use usage counter per item
for example:
struct _data
{
DWORD usage;
bool delete;
// here add your data
_data() { usage=0; deleted=true; }
};
const int MAX = 1024;
_data data[MAX];
now when item is started to be used somwhere then
// start use of data[i]
data[i].cnt++;
after is no longer used then
// stop use of data[i]
data[i].cnt--;
if you want to add new item to list then
// add item
for (i=0;i<MAX;i++) // find first deleted item
if (data[i].deleted)
{
data[i].deleted=false;
data[i].cnt=0;
// copy/set your data
break;
}
and now in the background once in a while (on timer or whatever)
scann data[] an all undeleted items with cnt == 0 set as deleted (+ free its dynamic memory if it has any)
[Note]
to avoid multi-thread access problems implement single global lock per data list
and program it so you cannot scann data while any data[i].cnt is changing
one bool and one DWORD suffice for this if you do not want to use OS locks
// globals
bool data_cnt_locked=false;
DWORD data_cnt=0;
now any change of data[i].cnt modify like this:
// start use of data[i]
while (data_cnt_locked) Sleep(1);
data_cnt++;
data[i].cnt++;
data_cnt--;
and modify delete scan like this
while (data_cnt) Sleep(1);
data_cnt_locked=true;
Sleep(1);
if (data_cnt==0) // just to be sure
for (i=0;i<MAX;i++) // here scan for items to delete ...
if (!data[i].cnt)
if (!data[i].deleted)
{
data[i].deleted=true;
data[i].cnt=0;
// release your dynamic data ...
}
data_cnt_locked=false;
PS.
do not forget to play with the sleep times a little to suite your needs
lock free algorithm sleep times are sometimes dependent on OS task/scheduler
this is not really an lock free implementation
because while GC is at work then all is locked
but if ather than that multi access is not blocking to each other
so if you do not run GC too often you are fine

How are "nonblocking" data structures possible?

I'm having trouble understanding how any data structure can be "nonblocking".
Say you're making a "nonblocking" hashtable. At some point or another, your hashtable gets too full, so you have to re-hash into a larger table.
This implies you need to allocate memory, which is a global resource. So it seems that you must obtain some sort of lock to prevent global corruption of the heap... irrespective of possible problems with your data structure itself!
But then that means every other thread must block while you allocate your memory...
What am I missing here?
(How) can you allocate memory without blocking another thread which is doing the same?
Two examples for non blocking designs are optimistic design and Transactional Memory.
The idea of this is - in most of the cases, the blocking is redundant - since two OPs can concurrently occur without interrupting each other. However, sometimes when 2 OPs occur concurrently and the data becomes corrupted because of it - you can roll back to your previous state, and retry.
There might still be locks in these designs, but the time the data is locked is significantly shorter, and is limited only to the critical time where the affect of the OP is taking place.
Just for some definitions, additional information and to distinguish between non-blocking, lock-free and wait-free terms, I recommend reading the following article (I won't copy the relevant passages here as it's too long):
Definitions of Non-blocking, Lock-free and Wait-free
Most strategies have one fundamental pattern in common. They use a compare and swap (CAS) operation in a loop until it succeeds.
For example, lets consider a stack implemented with a linked list. I chose a linked list implementation because it is easy to make concurrent with a CAS, but there are other ways to do it. I will use C-like pseudocode.
Push(T item)
{
Node node = new Node(); // allocate node memory
Node initial;
do
{
initial = head;
node.Value = item;
node.Next = initial;
}
while (CompareAndSwap(head, node, initial) != initial);
}
Pop()
{
Node node;
Node initial;
do
{
initial = head;
node = initial.Next;
}
while (CompareAndSwap(head, node, initial) != initial);
T value = initial.Value;
delete initial; // deallocate node memory
return value;
}
In the above code CompareAndSwap is a non-blocking atomic operation that replaces the value in a memory address with a new value and returns the old value. If the old value does not match the expected value then you spin through the loop and try it all again.
All that non-blocking means is that you never wait indefinitely, not that you never wait at all. As long as your heap is also implemented using a non-blocking algorithm, you can implement other non-blocking algorithms on top of it.

Thread safety for arrays in D?

Please bear with me on this as I'm new to this.
I have an array and two threads.
First thread appends new elements to the array when required
myArray ~= newArray;
Second thread removes elements from the array when required:
extractedArray = myArray[0..10];
myArray = myArray[10..myArray.length()];
Is this thread safe?
What happens when the two threads interact on the array at the exact same time?
No, it is not thread-safe. If you share data across threads, then you need to deal with making it thread-safe yourself via facilities such as synchronized statements, synchronized functions, core.atomic, and mutexes.
However, the other major thing that needs to be pointed out is that all data in D is thread-local by default. So, you can't access data across threads unless it's explicitly shared. So, you don't normally have to worry about thread safety at all. It's only when you explicitly share data that it's an issue.
this is not thread safe
this has the classic lost update race:
appending means examening the array to see if it can expand in-place, if not it needs to make a (O(n) time) copy while the copy is busy the other thread can slice of a piece and when the copy is done that piece will return
you should look into using a linked list implementation which are easier to make thread safe
Java's ConcurrentLinkedQueue uses the list described here for it's implementation and you can implement it with the core.atomic.cas() in the standard library
It is not thread-safe. The simplest way to fix this is to surround array operations with the synchronized block. More about it here: http://dlang.org/statement.html#SynchronizedStatement

Clojure mutable storage types

I'm attempting to learn Clojure from the API and documentation available on the site. I'm a bit unclear about mutable storage in Clojure and I want to make sure my understanding is correct. Please let me know if there are any ideas that I've gotten wrong.
Edit: I'm updating this as I receive comments on its correctness.
Disclaimer: All of this information is informal and potentially wrong. Do not use this post for gaining an understanding of how Clojure works.
Vars always contain a root binding and possibly a per-thread binding. They are comparable to regular variables in imperative languages and are not suited for sharing information between threads. (thanks Arthur Ulfeldt)
Refs are locations shared between threads that support atomic transactions that can change the state of any number of refs in a single transaction. Transactions are committed upon exiting sync expressions (dosync) and conflicts are resolved automatically with STM magic (rollbacks, queues, waits, etc.)
Agents are locations that enable information to be asynchronously shared between threads with minimal overhead by dispatching independent action functions to change the agent's state. Agents are returned immediately and are therefore non-blocking, although an agent's value isn't set until a dispatched function has completed.
Atoms are locations that can be synchronously shared between threads. They support safe manipulation between different threads.
Here's my friendly summary based on when to use these structures:
Vars are like regular old variables in imperative languages. (avoid when possible)
Atoms are like Vars but with thread-sharing safety that allows for immediate reading and safe setting. (thanks Martin)
An Agent is like an Atom but rather than blocking it spawns a new thread to calculate its value, only blocks if in the middle of changing a value, and can let other threads know that it's finished assigning.
Refs are shared locations that lock themselves in transactions. Instead of making the programmer decide what happens during race conditions for every piece of locked code, we just start up a transaction and let Clojure handle all the lock conditions between the refs in that transaction.
Also, a related concept is the function future. To me, it seems like a future object can be described as a synchronous Agent where the value can't be accessed at all until the calculation is completed. It can also be described as a non-blocking Atom. Are these accurate conceptions of future?
It sounds like you are really getting Clojure! good job :)
Vars have a "root binding" visible in all threads and each individual thread can change the value it sees with out affecting the other threads. If my understanding is correct a var cannot exist in just one thread with out a root binding that is visible to all and it cant be "rebound" until it has been defined with (def ... ) the first time.
Refs are committed at the end of the (dosync ... ) transaction that encloses the changes but only when the transaction was able to finish in a consistent state.
I think your conclusion about Atoms is wrong:
Atoms are like Vars but with thread-sharing safety that blocks until the value has changed
Atoms are changed with swap! or low-level with compare-and-set!. This never blocks anything. swap! works like a transaction with just one ref:
the old value is taken from the atom and stored thread-local
the function is applied to the old value to generate a new value
if this succeeds compare-and-set is called with old and new value; only if the value of the atom has not been changed by any other thread (still equals old value), the new value is written, otherwise the operation restarts at (1) until is succeeds eventually.
I've found two issues with your question.
You say:
If an agent is accessed while an action is occurring then the value isn't returned until the action has finished
http://clojure.org/agents says:
the state of an Agent is always immediately available for reading by any thread
I.e. you never have to wait to get the value of an agent (I assume the value changed by an action is proxied and changed atomically).
The code for the deref-method of an Agent looks like this (SVN revision 1382):
public Object deref() throws Exception{
if(errors != null)
{
throw new Exception("Agent has errors", (Exception) RT.first(errors));
}
return state;
}
No blocking is involved.
Also, I don't understand what you mean (in your Ref section) by
Transactions are committed on calls to deref
Transactions are committed when all actions of the dosync block have been completed, no exceptions have been thrown and nothing has caused the transaction to be retried. I think deref has nothing to do with it, but maybe I misunderstand your point.
Martin is right when he say that Atoms operation restarts at 1. until is succeeds eventually.
It is also called spin waiting.
While it is note really blocking on a lock the thread that did the operation is blocked until the operation succeeds so it is a blocking operation and not an asynchronously operation.
Also about Futures, Clojure 1.1 has added abstractions for promises and futures.
A promise is a synchronization construct that can be used to deliver a value from one thread to another. Until the value has been delivered, any attempt to dereference the promise will block.
(def a-promise (promise))
(deliver a-promise :fred)
Futures represent asynchronous computations. They are a way to get code to run in another thread, and obtain the result.
(def f (future (some-sexp)))
(deref f) ; blocks the thread that derefs f until value is available
Vars don't always have a root binding. It's legal to create a var without a binding using
(def x)
or
(declare x)
Attempting to evaluate x before it has a value will result in
Var user/x is unbound.
[Thrown class java.lang.IllegalStateException]

Design Pattern for multithreaded observers

In a digital signal acquisition system, often data is pushed into an observer in the system by one thread.
example from Wikipedia/Observer_pattern:
foreach (IObserver observer in observers)
observer.Update(message);
When e.g. a user action from e.g. a GUI-thread requires the data to stop flowing, you want to break the subject-observer connection, and even dispose of the observer alltogether.
One may argue: you should just stop the data source, and wait for a sentinel value to dispose of the connection. But that would incur more latency in the system.
Of course, if the data pumping thread has just asked for the address of the observer, it might find it's sending a message to a destroyed object.
Has someone created an 'official' Design Pattern countering this situation? Shouldn't they?
If you want to have the data source to always be on the safe side of concurrency, you should have at least one pointer that is always safe for him to use.
So the Observer object should have a lifetime that isn't ended before that of the data source.
This can be done by only adding Observers, but never removing them.
You could have each observer not do the core implementation itself, but have it delegate this task to an ObserverImpl object.
You lock access to this impl object. This is no big deal, it just means the GUI unsubscriber would be blocked for a little while in case the observer is busy using the ObserverImpl object. If GUI responsiveness would be an issue, you can use some kind of concurrent job-queue mechanism with an unsubscription job pushed onto it. ( like PostMessage in Windows )
When unsubscribing, you just substitute the core implementation for a dummy implementation. Again this operation should grab the lock. This would indeed introduce some waiting for the data source, but since it's just a [ lock - pointer swap - unlock ] you could say that this is fast enough for real-time applications.
If you want to avoid stacking Observer objects that just contain a dummy, you have to do some kind of bookkeeping, but this could boil down to something trivial like an object holding a pointer to the Observer object he needs from the list.
Optimization :
If you also keep the implementations ( the real one + the dummy ) alive as long as the Observer itself, you can do this without an actual lock, and use something like InterlockedExchangePointer to swap the pointers.
Worst case scenario : delegating call is going on while pointer is swapped --> no big deal all objects stay alive and delegating can continue. Next delegating call will be to new implementation object. ( Barring any new swaps of course )
You could send a message to all observers informing them the data source is terminating and let the observers remove themselves from the list.
In response to the comment, the implementation of the subject-observer pattern should allow for dynamic addition / removal of observers. In C#, the event system is a subject/observer pattern where observers are added using event += observer and removed using event -= observer.

Resources