Will atomic operations block other threads?

Will atomic operations block other threads? - multithreading

I am trying to make "atomic vs non atomic" concept settled in my mind. My first problem is I could not find "real-life analogy" on that. Like customer/restaurant relationship over atomic operations or something similar.
Also I would like to learn about how atomic operations places themselves in thread-safe programming.
In this blog post; http://preshing.com/20130618/atomic-vs-non-atomic-operations/
it is mentioned as:
An operation acting on shared memory is atomic if it completes in a
single step relative to other threads. When an atomic store is
performed on a shared variable, no other thread can observe the
modification half-complete. When an atomic load is performed on a
shared variable, it reads the entire value as it appeared at a single
moment in time. Non-atomic loads and stores do not make those
guarantees.
What is the meaning of "no other thread can observe the modification half-complete"?
That means thread will wait until atomic operation is done? How that thread know about that operation is atomic? For example in .NET I can understand if you lock the object you set a flag to block other threads. But what about atomic? How other threads know difference between atomic and non-atomic operations?
Also if above statement is true, do all atomic operations are thread-safe?

Let's clarify a bit what is atomic and what are blocks. Atomicity means that operation either executes fully and all it's side effects are visible, or it does not execute at all. So all other threads can either see state before the operation or after it. Block of code guarded by a mutex is atomic too, we just don't call it an operation. Atomic operations are special CPU instructions which conceptually are similar to usual operation guarded by a mutex (you know what mutex is, so I'll use it, despite the fact that it is implemented using atomic operations). CPU has a limited set of operations which it can execute atomically, but due to hardware support they are very fast.
When we discuss thread blocks we usually involve mutexes in conversation because code guarded by them can take quite a time to execute. So we say that thread waits on a mutex. For atomic operations situation is the same, but they are fast and we usually don't care for delays here, so it is not that likely to hear words "block" and "atomic operation" together.
That means thread will wait until atomic operation is done?
Yes it will wait. CPU will restrict access to a block of memory where the variable is located and other CPU cores will wait. Note that for performance reasons that blocks are held only between atomic operations themselves. CPU cores are allowed to cache variables for read.
How that thread know about that operation is atomic?
Special CPU instructions are used. It is just written in your program that particular operation should be performed in atomic manner.
Additional information:
There are more tricky parts with atomic operations. For example on modern CPUs usually all reads and writes of primitive types are atomic. But CPU and compiler are allowed to reorder them. So it is possible that you change some struct, set a flag that telling that it is changed, but CPU reorders writes and sets flag before the struct is actually committed to memory. When you use atomic operations usually some additional efforts are done to prevent undesired reordering. If you want to know more, you should read about memory barriers.
Simple atomic stores and writes are not that useful. To make maximal use of atomic operations you need something more complex. Most common is a CAS - compare and swap. You compare variable with a value and change it only if comparison was successful.

On typical modern CPUs, atomic operations are made atomic this way:
When an instruction is issued that accesses memory, the core's logic attempts to put the core's cache in the correct state to access that memory. Typically, this state will be achieved before the memory access has to happen, so there is no delay.
While another core is performing an atomic operation on a chunk of memory, it locks that memory in its own cache. This prevents any other core from acquiring the right to access that memory until the atomic operation completes.
Unless two cores happen to be performing accesses to many of the same areas of memory and many of those accesses are writes, this typically won't involve any delays at all. That's because the atomic operation is very fast and typically the core knows in advance what memory it will need access to.
So, say a chunk of memory was last accessed on core 1 and now core 2 wants to do an atomic increment. When the core's prefetch logic sees the modification to that memory in the instruction stream, it will direct the cache to acquire that memory. The cache will use the intercore bus to take ownership of that region of memory from core 1's cache and it will lock that region in its own cache.
At this point, if another core tries to read or modify that region of memory, it will be unable to acquire that region in its cache until the lock is released. This communication takes place on the bus that connects the caches and precisely where it takes place depends on which cache(s) the memory was in. (If not in cache at all, then it has to go to main memory.)
A cache lock is not normally described as blocking a thread both because it is so fast and because the core is usually able to do other things while it's trying to acquire the memory region that is locked in the other cache. From the point of view of the higher-level code, the implementation of atomics is typically considered an implementation detail.
All atomic operations provide the guarantee that an intermediate result will not be seen. That's what makes them atomic.

The atomic operations you describe are instructions within the processor and the hardware will make sure that a read cannot happen on a memory location until the atomic write is complete. This guarantees that a thread either reads the value before write or the value after the write operation, but nothing in-between - there's no chance of reading half of the bytes of the value from before the write and the other half from after the write.
Code running against the processor is not even aware of this block but it's really no different from using a lock statement to make sure that a more complex operation (made up of many low-level instructions) is atomic.
A single atomic operation is always thread-safe - the hardware guarantees that the effect of the operation is atomic - it'll never get interrupted in the middle.
A set of atomic operations is not atomic in the vast majority of cases (I'm not an expert so I don't want to make a definitive statement but I can't think of a case where this would be different) - this is why locking is needed for complex operations: the entire operation may be made up of multiple atomic instructions but the whole of the operation may still be interrupted between any of those two instructions, creating the possibility of another thread seeing half-baked results. Locking ensures that code operating on shared data cannot access that data until the other operation completes (possibly over several thread switches).
Some examples are shown in this question / answer but you find many more by searching.

Being "atomic" is an attribute that applies to an operation which is enforced by the implementation (either the hardware or the compiler, generally speaking). For a real-life analogy, look to systems requiring transactions, such as bank accounts. A transfer from one account to another involves a withdrawal from one account and a deposit to another, but generally these should be performed atomically - there is no time when the money has been withdrawn but not yet deposited, or vice versa.
So, continuing the analogy for your question:
What is the meaning of "no other thread can observe the modification half-complete"?
This means that no thread could observe the two accounts in a state where the withdrawal had been made from one account but it had not been deposited in another.
In machine terms, it means that an atomic read of a value in one thread will not see a value with some bits from before an atomic write by another thread, and some bits from after the same write operation. Various operations more complex than just a single read or write can also be atomic: for instance, "compare and swap" is a commonly implemented atomic operation that checks the value of a variable, compares it to a second value, and replaces it with another value if the compared values were equal, atomically - so for instance, if the comparison succeeds, it is not possible for another thread to write a different value in between the compare and the swap parts of the operation. Any write by another thread will either be performed wholly before or wholly after the atomic compare-and-swap.
The title to your question is:
Will atomic operations block other threads?
In the usual meaning of "block", the answer is no; an atomic operation in one thread won't by itself cause execution to stop in another thread, although it may cause a livelock situation or otherwise prevent progress.
That means thread will wait until atomic operation is done?
Conceptually, it means that they will never need to wait. The operation is either done, or not done; it is never halfway done. In practice, atomic operations can be implemented using mutexes, at a significant performance cost. Many (if not most) modern processors support various atomic primitives at the hardware level.
Also if above statement is true, do all atomic operations are thread-safe?
If you compose atomic operations, they are no longer atomic. That is, I can do one atomic compare-and-swap operation followed by another, and the two compare-and-swaps will individually be atomic, but they are divisible. Thus you can still have concurrency errors.

Atomic operation means the system performs an operation in its entirety or not at all. Reading or writing an int64 is atomic (64bits System & 64bits CLR) because the system read/write the 8 bytes in one single operation, readers do not see half of the new value being stored and half of the old value. But be carefull :
long n = 0; // writing 'n' is atomic, 64bits OS & 64bits CLR
long m = n; // reading 'n' is atomic
....// some code
long o = n++; // is not atomic : n = n + 1 is doing a read then a write in 2 separate operations
To make atomicity happens to the n++ you can use the Interlocked API :
long o = Interlocked.Increment(ref n); // other threads are blocked while the atomic operation is running

Related

Is synchronization for variable change cheaper then for something else?

In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?

Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.

Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.

isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.

How does CAS instructions guarantee atomicity

According to Wiki, CAS do something like this:
function cas(p : pointer to int, old : int, new : int) returns bool {
if *p ≠ old {
return false
}
*p ← new
return true
}
Well, it seems for me that if several processors will try to execute CAS instruction with the same arguments, there can be several write attempts at the same time so it's not safe to do it anyway.
Where am I wrong?

Atomic read-compare-write instructions from multiple cores at the same time (on the same cache line) do contend with each other, but it's up to hardware to sort that out. Hardware arbitration of atomic RMW instructions is a real thing in modern CPUs, and provides some degree of fairness so that one thread spinning on lock cmpxchg can't totally block other threads doing the same thing.
(Although that's a bad design unless your retry could succeed without waiting for another thread to modify anything, e.g. a retry loop that implements fetch_or or similar can try again with the updated value of expected. But if waiting for a lock or flag to change, after the initial CAS fails, it's better to spin on an acquire or relaxed load and only do the CAS if it might succeed.)
There's no guarantee what order they happen in, which is why you need to carefully design your algorithm so that correctness only depends on that compare-and-exchange being atomic. (The ABA problem is a common pitfall).
BTW, that entire block of pseudocode happens as a single atomic operation. Making a read-compare-write or read-modify-write happen as a single atomic operation is much harder for the hardware than just stores, which MESIF/MOESI handle just fine.
are you sure? I thought that it's unsafe to do that because, for example, x86 doesn't guarantee atomicity of writes for non-aligned DWORDs
lock cmpxchg makes the operation atomic regardless of alignment. It's potentially a lot slower for unaligned, especially on cache-line splits where atomically modifying a single cache line isn't enough.
See also Atomicity on x86 where I explain what it means for an operation to be atomic.

If you read the wiki it says that CAS "is an atomic version of the following pseudocode" of the code you posted. Atomic means that the code will execute without interruptions from other threads. So even if several threads try to execute this code at the same time with the same arguments (like you suggest) only one of them will return true, because in practice they will not execute simultaneously since the atomicity require they run in isolation.
And since you mention "x86 doesn't guarantee atomicity of writes for non-aligned DWORDs", this is not an issue here either because the atomic property of the cas function.

How does the cache coherency protocol enforce atomicity?

I understand atomicity can be guaranteed on operations like xsub(), without using the LOCK prefix, by relying on the cache coherency protocol (MESI/MESIF).
1) How can the cache coherency protocol do this???
Its making me wonder if the cache coherency protocol can enforce atomicity, why do we need special atomic types/instructions etc?
2) If MOSI implements atomic instructions across multi-core systems then what is the purpose of LOCK? Legacy?
3) If MOSI implements atomic instructions and MOSI is used for all instructions- then why do atomic instructions cost so much? Surely the performance should be same as normal instructions.

Atomicity and Memory Ordering
For an operation to be atomic it must appear to be one undivided operation to any observer. That observer can be anything that can see the effect of the operation, whether its the thread does the operation, a different thread on the same processor a thread on different processor, or some component or device in the system. Observers that can't see the effect of the operation, whether the same thread, a different thread, or a device, don't affect whether the operation is atomic or not.
(Note that by processor I mean what Intel's documentation would call a logical processor. A system with two CPU sockets, each populated with a quad-core CPU with two logical processors per core would have a total of 16 processors.)
A related but different concept is memory ordering. Memory accesses are only sequentially consistent if they appear to an observer as happening in the order they occur in the program. This guarantee always applies then when the observer is the same thread as performed the operations. Other more limited guarantees of memory ordering are possible. A strong but not sequentially consistent ordering might guarantee many sorts of operations are ordered with respect to each other, but not all. A weak memory ordering provides no guarantees about how accesses appear to other threads.
Compilers and Atomicity
When you're writing a program in C or some other higher level language it may appear that certain operations are atomic and sequentially ordered, but the compiler only generally guarantees this when viewed from the same thread that performed those operations. However, from the compiler's perspective any code that runs when a thread is asynchronously interrupted happens in different thread of execution even if that code runs in the same OS thread. That means the code running in a signal handler or in a structured exception handler isn't guaranteed to see operations performed outside the the handler in the same thread as being atomic or sequentially consistent.
Because of the limited general guarantee the compiler is free to do things like implement what look to be atomic operations using multiple assembler instructions make them appear non-atomic to other observers. The compiler can also reorder memory accesses, even remove apparently redundant accesses entirely. It can do whatever optimizations it wants so long in the single uninterrupted thread case the program still behaves as if it were doing all those operations in program order.
In the multi-threaded case, or where signal or exception handlers a present, it's necessary to take special steps to inform the compiler where you need it to provide broader guarantees of atomicity and memory ordering. That's the purpose special atomic types and functions. Even if the CPU guarantees every instruction is atomic and every memory access is sequentially consistent to all other threads, the compiler doesn't.
Intel CPUs and Atomicity
Intel CPUs make it fairly easy for the compiler to provide these guarantees. Except for some odd cases, instructions are uninterruptable. Any event that causes the execution of an instruction to be interrupted either happens after the instruction is fully completed or allows the instruction to resumed as if it were never executed. The means that at the machine code level every operation is atomic and every memory operation is sequentially consistent as it appears to code running on the same processor. In the single processor case nothing needs to be done provide these guarantees except when they need to be to visible to devices other than the processor. In that case the LOCK prefix combined with uncached memory regions must be used to guarantee read/modify/write instructions are atomic and memory accesses appear sequentially consistent to other devices.
In the multi-processor case when accessing cached memory the cache coherency protocol provides guarantees of atomicity with most instructions and a strong memory ordering but not a sequentially consistent ordering. The exact mechanism by which is does this doesn't matter much, just the guarantees is gives. Any instruction that only accesses a single memory location will appear atomic to other processors. The ordering guarantees are too long to go into here, Intel uses 16 bullet points to describe them, but they apparently its a superset the guarantees that C and C++ provide with the acquire and release memory order. When that level of memory ordering is specified, the C/C++ atomic operations can use ordinary unlocked instructions.
The need for the LOCK prefix, and those instructions where the LOCK prefix is implicit, comes when you need stronger guarantees than the cache coherency protocol provides. If you need your read/modifiy/write instructions to be atomic you need to use the LOCK prefix. If you need sequentially consistent ordering you need to use the LOCK prefix.
The LOCK prefix is where the high cost of atomic operations comes from. It causes the processor to wait for all previous load and store operations to complete. Even though when accessing cached memory the LOCK prefix handled entirely within the cache without asserting LOCK#, the processor still needs to wait to ensure the operation appears sequentially consistent to other processors.
Summary
So in summary the answers to your questions are:
The cache coherency protocol can only enforce atomicity of certain machine code instruction when viewed from other processors. It can't ensure that the compiler generates a single instruction for an operation you want to be atomic. It also can't guarantee that the instruction appears to be atomic to non-processor devices on the system.
The LOCK prefix is used on machine code instructions that
perform multiple memory accesses and need appear to be atomic to other processors
need to be sequentially consistent to other processors
need to be atomic and/or sequentially consistent to other non-processor devices.
When its possible to get the necessary atomicity and memory ordering guarantees without using the LOCK prefix, the instructions used are the same as ordinary instructions and so cost the same. Where LOCK prefix is needed to provide the necessary guarantees the cost of the instruction becomes much higher than a normal instruction.

There is no xsub instruction in x86, but there is an xadd ;)
You should read the section about the LOCK prefix in the Instruction Set Reference, and the section 8.1 LOCKED ATOMIC OPERATIONS in the Software Developer's Manual Volume 3A: System Programming Guide, Part 1.
The single CPU refers to a single core nowadays, with its own cache. When you have multiple caches for multiple cores (physically in the same or separate cpu chips) they use some cache coherency protocol. In case of MESI, the core executing the atomic instruction will first ensure it has ownership of the cache line containing the operand and marks it modified, additionally locking it. If another core needs the cache line, it will do a read operation which the owner core will snoop and delay the answer until the atomic operation completes.
On single-cpu single-core systems, most instructions are atomic with respect to threading except for string instructions using a REP prefix because scheduling interrupts and thus context switches only happen on instruction boundaries. A hardware device could however observe non-atomic behaviour.

How is atomicity implemented by the CPU?

I have been told/read online the cache coherency protocol MESI/MESIF:
http://en.wikipedia.org/wiki/MESI_protocol
also enforces atomicity- for example for a lock. However, this really really doesn't make sense to me for the following reasons:
1) MESI manages cache access for all instructions. If MESI also enforces atomicity, how do we get race conditions? Surely all instructions would be atomic and we'd never get race conditions?
2) If MESI gurarantees atomicity, whats the point of the LOCK prefix?
3) Why do people say atomic instructions carry overhead- if they are implemented using the same cache coherency model as all other x86 instructions?
Generally-speaking could somebody please explain how the CPU implements locks at a low-level?

The LOCK prefix has one purpose, that is taking a lock on that address followed by instructing MESI to flush that cache line on all other processors followed so that reading or writing that address by all other processors (or hardware devices!) blocks until the lock is released (which it is at the end of the instruction).
The LOCK prefix is slow (several hundred cycles) because it has to synchronize the bus for the duration and the bus speed and latency is much lower than CPU speed.
General operation of LOCK instruction
1. validate
2. establish address lock on cache line
3. wait for all processors to flush (MESI kicks in here)
4. perform operation within cache line
5. flush cache line to RAM (which releases the lock)
Disclaimer: Much of this comes from the documentation of the Pentium F00F bug (where the validate part was erroneously done after establish lock) and so might be out of date.

As #voo said, you are confusing coherency with atomicity.
Cache coherency covers many scenarios, but the basic example is when 2 different agents (cores on a multicore chip, processors on a multi-socket one, etc..), access the same line, they may both have it cached locally. MESI guarantees that when one of them writes a new value, all other stale copies are first invalidated, to prevent usage of the old value. As a by-product, this in fact guarantees atomicity of a single read or write access to memory, on a cacheline granularity, which is part of the CPU charter on x86 (and many other architectures as well). It does more than that - it's a crucial part of memory ordering and consistency guarantees that the CPU provides you.
It does not, however, provide any larger scale of atomicity, which is crucial for handling concepts like thread-safety and critical sections. What you are referring to with the locked operations is a read-modify-write flow, which is not guaranteed to be atomic by default (at least not on common CPUs), since it consists of 2 distinct accesses to memory. without a lock in place, the CPU may receive a snoop in between, and must respond according to the MESI protocol. The following scenario is perfectly legal for e.g.:
core 0 | core 1
---------------------------------
y = read [x] |
increment y | store [x] <- z
|
store [x] <- y |
Meaning that your memory increment operation on core 0 didn't work as expected. If [x] holds a mutex for e.g, you may think it was free and that you managed to grab it, while core 1 already took it.
Having the read-modify-write operation on core 0 locked (and x86 provides many possible options, locked add/inc, locked compare-exchange, etc..), would stall the other cores until the operation is done, so it essentially enhances the inter-core protocol to allow rejecting snoops.
It should be noted that a simple MESI protocol, if used correctly with alternative guarantees (like fences), can provide lock-free methods to perform atomic operations.

I think the point is that while the cache is involved in ordinary memory operations, it is required to do more for atomic operations than for your run of the mill ones.
Added later...
For ordinary operations:
when writing to memory, your typical core/cpu will maintain a write
queue, so that once the write has been dispatched, the core/cpu
continues processing instructions, while some other mechanics deals
with emptying the queue of pending writes -- negotiating with the
cache as required. On some processors the pending writes need not be
written away in the order they were put into the queue.
when reading from memory, if the required value is not immediately
available, the core/cpu may continue processing instructions, while
some other mechanics perform the required reads -- negotiating with
the cache as required.
all of which is designed to allow the core/cpu to keep going, decoupled as far as possible from the truely ghastly business of accessing real memory, via layers of cache, which is all horribly slow.
Now, for your atomic operations, the state of the core/cpu has to be synchronised with the state of the cache/memory.
So, for a "release" store: (a) everything in the write queue must be completed, before (b) the "release" write itself is completed, before (c) normal processing can continue. So all the benefits of the asynchronous writing of cache/memory may have to be foregone, until the atomic write completes. Similarly, for an "acquire" load: any reads which come after the "acquire" read must be delayed.
As it happens, the x86 is remarkably "well behaved". It does not reorder writes, so a "release" store does not need any extra work to ensure that it comes after any earlier stores. On the read side it also does not need to do anything special for an "acquire". If two or more cores/cpus are reading and writing the same piece of memory, then there will be more invalidating and reloading of cache lines, with the attendant overhead. When doing a "sequentially consistent" store, it has to be followed by an explicit mfence operation, which will stall the cpu/core until all writes have been flushed from the write queue. It is true that "sequentially consistent" is easier to think about... but for code where access to shared data is protected by locks, "acquire"/"release" is sufficient.
For your atomic "read-modify-write" and conditional versions thereof, the interaction with the cache/memory is even stronger. The cpu/core executing the operation must not only synchronise itself with the state of cache/memory, it must also arrange for other cpus/cores which access the object of the atomic operation to stall until it is complete and has been written away (committed to cache/memory). The impact of this will depend on whether there is any actual contention with other cpu(s)/core(s) at that moment.

Which is more efficient, basic mutex lock or atomic integer?

For something simple like a counter if multiple threads will be increasing the number. I read that mutex locks can decrease efficiency since the threads have to wait. So, to me, an atomic counter would be the most efficient, but I read that internally it is basically a lock? So I guess I'm confused how either could be more efficient than the other.

Atomic operations leverage processor support (compare and swap instructions) and don't use locks at all, whereas locks are more OS-dependent and perform differently on, for example, Win and Linux.
Locks actually suspend thread execution, freeing up cpu resources for other tasks, but incurring in obvious context-switching overhead when stopping/restarting the thread.
On the contrary, threads attempting atomic operations don't wait and keep trying until success (so-called busy-waiting), so they don't incur in context-switching overhead, but neither free up cpu resources.
Summing up, in general atomic operations are faster if contention between threads is sufficiently low. You should definitely do benchmarking as there's no other reliable method of knowing what's the lowest overhead between context-switching and busy-waiting.

If you have a counter for which atomic operations are supported, it will be more efficient than a mutex.
Technically, the atomic will lock the memory bus on most platforms. However, there are two ameliorating details:
It is impossible to suspend a thread during the memory bus lock, but it is possible to suspend a thread during a mutex lock. This is what lets you get a lock-free guarantee (which doesn't say anything about not locking - it just guarantees that at least one thread makes progress).
Mutexes eventually end up being implemented with atomics. Since you need at least one atomic operation to lock a mutex, and one atomic operation to unlock a mutex, it takes at least twice long to do a mutex lock, even in the best of cases.

A minimal (standards compliant) mutex implementation requires 2 basic ingredients:
A way to atomically convey a state change between threads (the 'locked' state)
memory barriers to enforce memory operations protected by the mutex to stay inside the protected area.
There is no way you can make it any simpler than this because of the 'synchronizes-with' relationship the C++ standard requires.
A minimal (correct) implementation might look like this:
class mutex {
std::atomic<bool> flag{false};
public:
void lock()
{
while (flag.exchange(true, std::memory_order_relaxed));
std::atomic_thread_fence(std::memory_order_acquire);
}
void unlock()
{
std::atomic_thread_fence(std::memory_order_release);
flag.store(false, std::memory_order_relaxed);
}
};
Due to its simplicity (it cannot suspend the thread of execution), it is likely that, under low contention, this implementation outperforms a std::mutex.
But even then, it is easy to see that each integer increment, protected by this mutex, requires the following operations:
an atomic store to release the mutex
an atomic compare-and-swap (read-modify-write) to acquire the mutex (possibly multiple times)
an integer increment
If you compare that with a standalone std::atomic<int> that is incremented with a single (unconditional) read-modify-write (eg. fetch_add),
it is reasonable to expect that an atomic operation (using the same ordering model) will outperform the case whereby a mutex is used.

atomic integer is a user mode object there for it's much more efficient than a mutex which runs in kernel mode. The scope of atomic integer is a single application while the scope of the mutex is for all running software on the machine.

The atomic variable classes in Java are able to take advantage of Compare and swap instructions provided by the processor.
Here's a detailed description of the differences: http://www.ibm.com/developerworks/library/j-jtp11234/

Mutex is a kernel level semantic which provides mutual exclusion even at the Process level. Note that it can be helpful in extending mutual exclusion across process boundaries and not just within a process (for threads). It is costlier.
Atomic Counter, AtomicInteger for e.g., is based on CAS, and usually try attempting to do operation until succeed. Basically, in this case, threads race or compete to increment\decrement the value atomically. Here, you may see good CPU cycles being used by a thread trying to operate on a current value.
Since you want to maintain the counter, AtomicInteger\AtomicLong will be the best for your use case.

Most processors have supported an atomic read or write, and often an atomic cmp&swap. This means that the processor itself writes or reads the latest value in a single operation, and there might be a few cycles lost compared to a normal integer access, especially as the compiler can't optimise around atomic operations nearly as well as normal.
On the other hand a mutex is a number of lines of code to enter and leave, and during that execution other processors that access the same location are totally stalled, so clearly a big overhead on them. In unoptimised high-level code, the mutex enter/exit and the atomic will be function calls, but for mutex, any competing processor will be locked out while your mutex enter function returns, and while your exit function is started. For atomic, it is only the duration of the actual operation which is locked out. Optimisation should reduce that cost, but not all of it.
If you are trying to increment, then your modern processor probably supports atomic increment/decrement, which will be great.
If it does not, then it is either implemented using the processor atomic cmp&swap, or using a mutex.
Mutex:
get the lock
read
increment
write
release the lock
Atomic cmp&swap:
atomic read the value
calc the increment
do{
atomic cmpswap value, increment
recalc the increment
}while the cmp&swap did not see the expected value
So this second version has a loop [incase another processor increments the value between our atomic operations, so value no longer matches, and increment would be wrong] that can get long [if there are many competitors], but generally should still be quicker than the mutex version, but the mutex version may allow that processor to task switch.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string