User defined atomic less than - multithreading

I've been reading and it seems that std::atomic doesn't support a compare and swap of the less/greater than variant.
I'm using OpenMP and need to safely update a global minimum value.
I was thinking this would be as easy as using a built-in API.
But alas, so instead I'm trying to come up with my own implementation.
I'm primarily concerned with the fact that I don't want to use an omp critical section to do a less than comparison every single time because it may incur significant synchronization overhead for very little gain in most cases.
But in those cases where a new global minima is potentially found (less often), the synchronization overhead is acceptable. I'm thinking I can implement it using the following method. Hoping for someone to advise.
Use an std::atomic_uint as the global minima.
Atomically read the value into thread local stack.
Compare it against the current value and if it's less, attempt to enter a critical section.
Once synchronized, verify that the atomic value is still less than the new one and update accordingly (the body of the critical section should be cheap, just update a few values).
This is for a homework assignment, so I'm trying to keep the implementation my own. Please don't recommend various libraries to accomplish this. But please do comment on the synchronization overhead that this operation can incur or if it's bad, elaborate on why. Thanks.

What you're looking for would be called fetch_min() if it existed: fetch old value and update the value in memory to min(current, new), exactly like fetch_add but with min().
This operation is not directly supported in hardware on x86, but machines with LL/SC could emit slightly more efficient asm for it than from emulating it with a CAS ( old, min(old,new) ) retry loop.
You can emulate any atomic operation with a CAS retry loop. In practice it usually doesn't have to retry, because the CPU that succeeded at doing a load usually also succeeds at CAS a few cycles later after computing whatever with the load result, so it's efficient.
See Atomic double floating point or SSE/AVX vector load/store on x86_64 for an example of creating a fetch_add for atomic<double> with a CAS retry loop, in terms of compare_exchange_weak and plain + for double. Do that with min and you're all set.
Re: clarification in comments: I think you're saying you have a global minimum, but when you find a new one, you want to update some associated data, too. Your question is confusing because "compare and swap on less/greater than" doesn't help you with that.
I'd recommend using atomic<unsigned> globmin to track the global minimum, so you can read it to decide whether or not to enter the critical section and update related state that goes with that minimum.
Only ever modify globmin while holding the lock (i.e. inside the critical section). Then you can update it + the associated data. It has to be atomic<> so readers that look at just globmin outside the critical section don't have data race UB. Readers that look at the associated extra data must take the lock that protects it and makes sure that updates of globmin + the extra data happen "atomically", from the perspective of readers that obey the lock.
static std::atomic<unsigned> globmin;
std::mutex globmin_lock;
static struct Extradata globmin_extra;
void new_min_candidate(unsigned newmin, const struct Extradata &newdata)
{
// light-weight early out check to avoid the critical section
// No ordering requirement as long as globmin is monotonically decreasing with time
if (newmin < globmin.load(std::memory_order_relaxed))
{
// enter a critical section. Use OpenMP stuff if you want, this is plain ISO C++
std::lock_guard<std::mutex> lock(globmin_lock);
// Check globmin again, after we've excluded other threads from modifying it and globmin_extra
if (newmin < globmin.load(std::memory_order_relaxed)) {
globmin.store(newmin, std::memory_order_relaxed);
globmin_extra = newdata;
}
// else leave the critical section with no update:
// another thread raced with use *outside* the critical section
// release the lock / leave critical section (lock goes out of scope here: RAII)
}
// else do nothing
}
std::memory_order_relaxed is sufficient for globmin: there's no ordering required with anything else, just atomicity. We get atomicity / consistency for the associated data from the critical section/lock, not from memory-ordering semantics of loading / storing globmin.
This way the only atomic read-modify-write operation is the locking itself. Everything on globmin is either load or store (much cheaper). The main cost with multiple threads will still be bouncing the cache line around, but once you own a cache line, each atomic RMW is maybe 20x more expensive than a simple store on modern x86 (http://agner.org/optimize/).
With this design, if most candidates aren't lower than globmin, the cache line will stay in the Shared state most of the time, so the globmin.load(std::memory_order_relaxed) outside the critical section can hit in L1D cache. It's just an ordinary load instruction, so it's extremely cheap. (On x86, even seq-cst loads are just ordinary loads (and release loads are just ordinary stores, but seq_cst stores are more expensive). On other architectures where the default ordering is weaker, seq_cst / acquire loads need a barrier.)

Related

Confusion about C++11 lock free stack push() function

I'm reading C++ Concurrency in Action by Anthony Williams, and don't understand its push implementation of the lock_free_stack class. Listing 7.12 to be precise
void push(T const& data)
{
counted_node_ptr new_node;
new_node.ptr=new node(data);
new_node.external_count=1;
new_node.ptr->next=head.load(std::memory_order_relaxed)
while(!head.compare_exchange_weak(new_node.ptr->next,new_node, std::memory_order_release, std::memory_order_relaxed));
}
So imagine 2 threads (A, B) calling push function. Both of them reach while loop but not start it. So they both read the same value from head.load(std::memory_order_relaxed).
Then we have the following things going on:
B thread gets swiped out for any reason
A thread starts the loop and obviously successfully adds a new node to the stack.
B thread gets back on track and also starts the loop.
And this is where it gets interesting as it seems to me.
Because there was a load operation with std::memory_order_relaxed and compare_exchange_weak(..., std::memory_order_release, ...) in case of success it looks like there is no synchronization between threads whatsoever.
I mean it's like std::memory_order_relaxed - std::memory_order_release and not std::memory_order_acquire - std::memory_order_release.
So B thread will simply add a new node to the stack but to its initial state when we had no nodes in the stack and reset head to this new node.
I was doing my research all around this subject and the best i could find was in this post Does exchange or compare_and_exchange reads last value in modification order?
So the question is, is it true? and all RMW functions see the last value in modification order? No matter what std::memory_order we used, if we use RMW operation it will synchronize with all threads (CPU and etc) and find the last value to be written to the atomic operation upon it is called?
So after some research and asking a bunch of people I believe I found the proper answer to this question, I hope it'll be a help to someone.
So the question is, is it true? and all RMW functions see the last
value in modification order?
Yes, it is true.
No matter what std::memory_order we used, if we use RMW operation it
will synchronize with all threads (CPU and etc) and find the last
value to be written to the atomic operation upon it is called?
Yes, it is also true, however there is something that needs to be highlighted.
RMW operation will synchronize only the atomic variable it works with. In our case, it is head.load
Perhaps you would like to ask why we need release - acquire semantics at all if RMW does the synchronization even with the relaxed memory order.
The answer is because RMW works only with the variable it synchronizes, but other operations which occurred before RMW might not be seen in the other thread.
lets look at the push function again:
void push(T const& data)
{
counted_node_ptr new_node;
new_node.ptr=new node(data);
new_node.external_count=1;
new_node.ptr->next=head.load(std::memory_order_relaxed)
while(!head.compare_exchange_weak(new_node.ptr->next,new_node, std::memory_order_release, std::memory_order_relaxed));
}
In this example, in case of using two push threads they won't be synchronized with each other to some extent, but it could be allowed here.
Both threads will always see the newest head because compare_exchange_weak
provides this. And a new node will be always added to the top of the stack.
However if we tried to get the value like this *(new_node.ptr->next) after this line new_node.ptr->next=head.load(std::memory_order_relaxed) things could easily turn ugly: empty pointer might be dereferenced.
This might happen because a processor can change the order of instructions and because there is no synchronization between threads the second thread could see the pointer to a top node even before that was initialized!
And this is exactly where release-acquire semantic comes to help. It ensures that all operations which happened before release operation will be seen in acquire part!
Check out and compare listings 5.5 and 5.8 in the book.
I also recommend you to read this article about how processors work, it might provide some essential information for better understanding.
memory barriers

Memory barrier in the implementation of single producer single consumer

The following implementation from Wikipedia:
volatile unsigned int produceCount = 0, consumeCount = 0;
TokenType buffer[BUFFER_SIZE];
void producer(void) {
while (1) {
while (produceCount - consumeCount == BUFFER_SIZE)
sched_yield(); // buffer is full
buffer[produceCount % BUFFER_SIZE] = produceToken();
// a memory_barrier should go here, see the explanation above
++produceCount;
}
}
void consumer(void) {
while (1) {
while (produceCount - consumeCount == 0)
sched_yield(); // buffer is empty
consumeToken(buffer[consumeCount % BUFFER_SIZE]);
// a memory_barrier should go here, the explanation above still applies
++consumeCount;
}
}
says that a memory barrier must be used between the line that accesses the buffer and the line that updates the Count variable.
This is done to prevent the CPU from reordering the instructions above the fence along-with that below it. The Count variable shouldn't be incremented before it is used to index into the buffer.
If a fence is not used, won't this kind of reordering violate the correctness of code? The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Thanks
If a fence is not used, won't this kind of reordering violate the correctness of code? The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Good question.
In c++, unless some form of memory barrier is used (atomic, mutex, etc), the compiler assumes that the code is single-threaded. In which case, the as-if rule says that the compiler may emit whatever code it likes, provided that the overall observable effect is 'as if' your code was executed sequentially.
As mentioned in the comments, volatile does not necessarily alter this, being merely an implementation-defined hint that the variable may change between accesses (this is not the same as being modified by another thread).
So if you write multi-threaded code without memory barriers, you get no guarantees that changes to a variable in one thread will even be observed by another thread, because as far as the compiler is concerned that other thread should not be touching the same memory, ever.
What you will actually observe is undefined behaviour.
It seems, that your question is "can incrementing Count and assigment to buffer be reordered without changing code behavior?".
Consider following code tansformation:
int count1 = produceCount++;
buffer[count1 % BUFFER_SIZE] = produceToken();
Notice that code behaves exactly as original one: one read from volatile variable, one write to volatile, read happens before write, state of program is the same. However, other threads will see different picture regarding order of produceCount increment and buffer modifications.
Both compiler and CPU can do that transformation without memory fences, so you need to force those two operations to be in correct order.
If a fence is not used, won't this kind of reordering violate the correctness of code?
Nope. Can you construct any portable code that can tell the difference?
The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Why shouldn't it? What would the payoff be for the costs incurred? Things like write combining and speculative fetching are huge optimizations and disabling them is a non-starter.
If you're thinking that volatile alone should do it, that's simply not true. The volatile keyword has no defined thread synchronization semantics in C or C++. It might happen to work on some platforms and it might happen not to work on others. In Java, volatile does have defined thread synchronization semantics, but they don't include providing ordering for accesses to non-volatiles.
However, memory barriers do have well-defined thread synchronization semantics. We need to make sure that no thread can see that data is available before it sees that data. And we need to make sure that a thread that marks data as able to be overwritten is not seen before the thread is finished with that data.

Extreme slowdown, OpenMP probably see unexisting race conditions?

my code on OpenMP gets very slow when I add the (*pRandomTrial)++; after generating random number. To g_iRandomTrials[32] I store number of rand() calls from each thread. Each thread writes different index of this array, there are no race conditions, results are OK, but this very easy counter makes the program almost 10 times slower then without counter. Is there some keyword I can use in this case? I tried some setups with firstprivate(g_iRandomTrials), but I was never successfull. When I create int counter in Simulate() function and use the pointer only twice on start and on the end of function, code will run probably much faster, but this seems as somewhat ugly solution, as it doesn't do anything about the problem...
int g_iRandomTrials[32];
...
#pragma omp parallel
{
do
{
...
Simulate();
...
}
}
void Simulate(void)
{
...
int id=omp_get_thread_num();
int*pRandomTrial=g_iRandomTrials+id;
...
while (used[index])
{
index=rand()%50;
(*pRandomTrial)++;
}
}
The reason for the slow down is called false sharing. The answer is padding.
In computer science, false sharing is a performance-degrading usage
pattern that can arise in systems with distributed, coherent caches at
the size of the smallest resource block managed by the caching
mechanism. When a system participant attempts to periodically access
data that will never be altered by another party, but that data shares
a cache block with data that is altered, the caching protocol may
force the first participant to reload the whole unit despite a lack of
logical necessity. The caching system is unaware of activity within
this block and forces the first participant to bear the caching system
overhead required by true shared access of a resource.
https://en.wikipedia.org/wiki/False_sharing
CPUs lock their memory in something called cachelines. These tend to be 64-bytes in length. When one core accesses a variable, it locks the entire cacheline and fetches it from memory. Other cores can no longer access it until the lock is released.
The answer is to pad and align your randomTrials in such a way that no value is within 64-bytes of another. Keep in mind that 64-byte value is most common, but there are architectures were this differs.

How/when to release memory in wait-free algorithms

I'm having trouble figuring out a key point in wait-free algorithm design. Suppose a data structure has a pointer to another data structure (e.g. linked list, tree, etc), how can the right time for releasing a data structure?
The problem is this, there are separate operations that can't be executed atomically without a lock. For example one thread reads the pointer to some memory, and increments the use count for that memory to prevent free while this thread is using the data, which might take long, and even if it doesn't, it's a race condition. What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count?
The main issue is that current CPUs only have a single word CAS (compare & swap). Alternatively the problem is that I'm clueless about waitfree algorithms and data structures and after reading some papers I'm still not seeing the light.
IMHO Garbage collection can't be the answer, because it would either GC would have to be prevented from running if any single thread is inside an atomic block (which would mean it can't be guaranteed that the GC will ever run again) or the problem is simply pushed to the GC, in which case, please explain how the GC would figure out if the data is in the silly state (a pointer is read [e.g. stored in a local variable] but the the use count didn't increment yet).
PS, references to advanced tutorials on wait-free algorithms for morons are welcome.
Edit: You should assume that the problem is being solved in a non-managed language, like C or C++. After all if it were Java, we'd have no need to worry about releasing memory. Further assume that the compiler may generate code that will store temporary references to objects in registers (invisible to other threads) right before the usage counter increment, and that a thread can be interrupted between loading the object address and incrementing the counter. This of course doesn't mean that the solution must be limited to C or C++, rather that the solution should give a set of primitives that allowing the implementation of wait-free algorithms on linked data structures. I'm interested in the primitives and how they solve the problem of designing wait-free algorithms. With such primitives a wait-free algorithm can be implemented equally well in C++ and Java.
After some research I learned this.
The problem is not trivial to solve and there are several solutions each with advantages and disadvantages. The reason for the complexity comes from inter CPU synchronization issues. If not done right it might appear to work correctly 99.9% of the time, which isn't enough, or it might fail under load.
Three solutions that I found are 1) hazard pointers, 2) quiescence period based reclamation (used by the Linux kernel in the RCU implementation) 3) reference counting techniques. 4) Other 5) Combinations
Hazard pointers work by saving the currently active references in a well-known per thread location, so any thread deciding to free memory (when the counter appears to be zero) can check if the memory is still in use by anyone. An interesting improvement is to buffer request to release memory in a small array and free them up in a batch when the array is full. The advantage of using hazard pointers is that it can actually guarantee an upper bound on unreclaimed memory. The disadvantage is that it places extra burden on the reader.
Quiescence period based reclamation works by delaying the actual release of the memory until it's known that each thread has had a chance to finish working on any data that may need to be released. The way to know that this condition is satisfied is to check if each thread passed through a quiescent period (not in a critical section) after the object was removed. In the Linux kernel this means something like each task making a voluntary task switch. In a user space application it would be the end of a critical section. This can be achieved by a simple counter, each time the counter is even the thread is not in a critical section (reading shared data), each time the counter is odd the thread is inside a critical section, to move from a critical section or back all the thread needs to do is to atomically increment the number. Based on this the "garbage collector" can determine if each thread has had a chance to finish. There are several approaches, one simple one would be to queue up the requests to free memory (e.g. in a linked list or an array), each with the current generation (managed by the GC), when the GC runs it checks the state of the threads (their state counters) to see if each passed to the next generation (their counter is higher than the last time or is the same and even), any memory can be reclaimed one generation after it was freed. The advantage of this approach is that is places the least burden on the reading threads. The disadvantage is that it can't guarantee an upper bound for the memory waiting to be released (e.g. one thread spending 5 minutes in a critical section, while the data keeps changing and memory isn't released), but in practice it works out all right.
There is a number of reference counting solutions, many of them require double compare and swap, which some CPUs don't support, so can't be relied upon. The key problem remains though, taking a reference before updating the counter. I didn't find enough information to explain how this can be done simply and reliably though. So .....
There are of course a number of "Other" solutions, it's a very important topic of research with tons of papers out there. I didn't examine all of them. I only need one.
And of course the various approaches can be combined, for example hazard pointers can solve the problems of reference counting. But there's a nearly infinite number of combinations, and in some cases a spin lock might theoretically break wait-freedom, but doesn't hurt performance in practice. Somewhat like another tidbit I found in my research, it's theoretically not possible to implement wait-free algorithms using compare-and-swap, that's because in theory (purely in theory) a CAS based update might keep failing for non-deterministic excessive times (imagine a million threads on a million cores each trying to increment and decrement the same counter using CAS). In reality however it rarely fails more than a few times (I suspect it's because the CPUs spend more clocks away from CAS than there are CPUs, but I think if the algorithm returned to the same CAS on the same location every 50 clocks and there were 64 cores there could be a chance of a major problem, then again, who knows, I don't have a hundred core machine to try this). Another results of my research is that designing and implementing wait-free algorithms and data-structures is VERY challenging (even if some of the heavy lifting is outsourced, e.g. to a garbage collector [e.g. Java]), and might perform less well than a similar algorithm with carefully placed locks.
So, yeah, it's possible to free memory even without delays. It's just tricky. And if you forget to make the right operations atomic, or to place the right memory barrier, oh, well, you're toast. :-) Thanks everyone for participating.
I think atomic operations for increment/decrement and compare-and-swap would solve this problem.
Idea:
All resources have a counter which is modified with atomic operations. The counter is initially zero.
Before using a resource: "Acquire" it by atomically incrementing its counter. The resource can be used if and only if the incremented value is greater than zero.
After using a resource: "Release" it by atomically decrementing its counter. The resource should be disposed/freed if and only if the decremented value is equal to zero.
Before disposing: Atomically compare-and-swap the counter value with the minimum (negative) value. Dispose will not happen if a concurrent thread "Acquired" the resource in between.
You haven't specified a language for your question. Here goes an example in c#:
class MyResource
{
// Counter is initially zero. Resource will not be disposed until it has
// been acquired and released.
private int _counter;
public bool Acquire()
{
// Atomically increment counter.
int c = Interlocked.Increment(ref _counter);
// Resource is available if the resulting value is greater than zero.
return c > 0;
}
public bool Release()
{
// Atomically decrement counter.
int c = Interlocked.Decrement(ref _counter);
// We should never reach a negative value
Debug.Assert(c >= 0, "Resource was released without being acquired");
// Dispose when we reach zero
if (c == 0)
{
// Mark as disposed by setting counter its minimum value.
// Only do this if the counter remain at zero. Atomic compare-and-swap operation.
if (Interlocked.CompareExchange(ref _counter, int.MinValue, c) == c)
{
// TODO: Run dispose code (free stuff)
return true; // tell caller that resource is disposed
}
}
return false; // released but still in use
}
}
Usage:
// "r" is an instance of MyResource
bool acquired = false;
try
{
if (acquired = r.Acquire())
{
// TODO: Use resource
}
}
finally
{
if (acquired)
{
if (r.Release())
{
// Resource was disposed.
// TODO: Nullify variable or similar to let GC collect it.
}
}
}
I know this is not the best way but it works for me:
for shared dynamic data-structure lists I use usage counter per item
for example:
struct _data
{
DWORD usage;
bool delete;
// here add your data
_data() { usage=0; deleted=true; }
};
const int MAX = 1024;
_data data[MAX];
now when item is started to be used somwhere then
// start use of data[i]
data[i].cnt++;
after is no longer used then
// stop use of data[i]
data[i].cnt--;
if you want to add new item to list then
// add item
for (i=0;i<MAX;i++) // find first deleted item
if (data[i].deleted)
{
data[i].deleted=false;
data[i].cnt=0;
// copy/set your data
break;
}
and now in the background once in a while (on timer or whatever)
scann data[] an all undeleted items with cnt == 0 set as deleted (+ free its dynamic memory if it has any)
[Note]
to avoid multi-thread access problems implement single global lock per data list
and program it so you cannot scann data while any data[i].cnt is changing
one bool and one DWORD suffice for this if you do not want to use OS locks
// globals
bool data_cnt_locked=false;
DWORD data_cnt=0;
now any change of data[i].cnt modify like this:
// start use of data[i]
while (data_cnt_locked) Sleep(1);
data_cnt++;
data[i].cnt++;
data_cnt--;
and modify delete scan like this
while (data_cnt) Sleep(1);
data_cnt_locked=true;
Sleep(1);
if (data_cnt==0) // just to be sure
for (i=0;i<MAX;i++) // here scan for items to delete ...
if (!data[i].cnt)
if (!data[i].deleted)
{
data[i].deleted=true;
data[i].cnt=0;
// release your dynamic data ...
}
data_cnt_locked=false;
PS.
do not forget to play with the sleep times a little to suite your needs
lock free algorithm sleep times are sometimes dependent on OS task/scheduler
this is not really an lock free implementation
because while GC is at work then all is locked
but if ather than that multi access is not blocking to each other
so if you do not run GC too often you are fine

Delphi threading - which parts of code need to be protected/synchronized?

so far I thought that any operation done on "shared" object (common for multiple threads) must be protected with "synchronize", no matter what. Apparently, I was wrong - in the code I'm studying recently there are plenty of classes (thread-safe ones, as the Author claims) and only one of them uses Critical Section for almost every method.
How do I find what parts / methods of my code needs to be protected with CriticalSection (or any other method) and which not?
So far I haven't stumbled upon any interesting explanation / article / blog note, all google results are:
a) examples of synchronization between thread and the GUI. From simple progressbar to most complex, but still the lesson is obvious: each time you access / modify the property of GUI component, do that in "Synchronize". But nothing more.
b) articles explaining Critical Sections, Mutexes etc. Just a different approaches of protection/synchronization.
c) Examples of very very simple thread-safe classes (thread safe stack or list) - they all do the same - implement lock / unlock methods which do enter/leave critical section and return the actual stack/list pointer on locking.
Now I'm looking for explanation which parts of code should be protected.
could be in form of code ;) but please don't provide me with one more "using Synchronize to update progressbar" ... ;)
thank you!
You are asking for specific answers to a very general question.
Basically, apart of UI operations, you should protect every shared memory/resource access to avoid two potentially competing threads to:
read inconsistent memory
write memory at the same time
try to use the same resource at the same time from more than one thread... until the resource is thread-safe.
Generally, I consider any other operation thread safe, including operations that access not shared memory or not shared objects.
For example, consider this object:
type
TThrdExample = class
private
FValue: Integer;
public
procedure Inc;
procedure Dec;
function Value: Integer;
procedure ThreadInc;
procedure ThreadDec;
function ThreadValue: Integer;
end;
ThreadVar
ThreadValue: Integer;
Inc, Dec and Value are methods which operate over FValue field. The methods are not thread safe until you protect them with some synchronization mechanism. It can be a MultipleReaderExclusiveWriterSinchronizer for Value function and CriticalSection for Inc and Dec methods.
ThreadInc and ThreadDec methods operate over ThreadValue variable, which is defined as ThreadVar, so I consider it ThreadSafe because the memory they access is not shared between threads... each call from different thread will access different memory address.
If you know that, by design, a class should be used only in one thread or inside other synchronization mechanisms, you're free to consider that thread safe by design.
If you want more specific answers, I suggest you try with a more specific question.
Best regards.
EDIT: Maybe someone say the integer fields is a bad example because you can consider integer operations atomic on Intel/Windows thus is not needed to protect it... but I hope you get the idea.
You misunderstood TThread.Synchronize method.
TThread.Synchronize and TThread.Queue methods executes protected code in the context of main (GUI) thread. That is why you should use Syncronize or Queue to update GUI controls (like progressbar) - normally only main thread should access GUI controls.
Critical Sections are different - the protected code is executed in the context of the thread that acquired critical section, and no other thread is permitted to acquire the critical section until the former thread releases it.
You use critical section in case there's a need for a certain set of objects to be updated atomically. This means, they must at all times be either already updated completely or not yet updated at all. They must never be accessible in a transitional state.
For example, with a simple integer reading/writing this is not the case. The operation of reading integer as well as the operation of writing it are atomic already: you cannot read integer in the middle of processor writing it, half-updated. It's either old value or new value, always.
But if you want to increment the integer atomically, you have not one, but three operations you have to do at once: read the old value into processor's cache, increment it, and write it back to memory. Each operation is atomic, but the three of them together are not.
One thread might read the old value (say, 200), increment it by 5 in cache, and at the same time another thread might read the value too (still 200). Then the first thread writes back 205, while the second thread increments its cached value of 200 to 203 and writes back 203, overwriting 205. The result of two increments (+5 and +3) should be 208, but it's 203 due to non-atomicity of operations.
So, you use critical sections when:
A variable, set of variables, or any resource is used from several threads and needs to be updated atomically.
It's not atomic by itself (for example, calling a function which is guarded by critical section inside of the function body, is an atomic operation already)
Have a read of this documentation
http://www.eonclash.com/Tutorials/Multithreading/MartinHarvey1.1/ToC.html
If you use messaging to communicate between threads then you can basically ignore synchronisation primitives completely because each thread only accesses its internal structures and the messages themselves. In essence this is far easier and more scalable architecture than using synchronisation primitives.

Resources