How do I Understand Read Memory Barriers and Volatile

How do I Understand Read Memory Barriers and Volatile - multithreading

Some languages provide a volatile modifier that is described as performing a "read memory barrier" prior to reading the memory that backs a variable.
A read memory barrier is commonly described as a way to ensure that the CPU has performed the reads requested before the barrier before it performs a read requested after the barrier. However, using this definition, it would seem that a stale value could still be read. In other words, performing reads in a certain order does not seem to mean that the main memory or other CPUs must be consulted to ensure that subsequent values read actually reflect the latest in the system at the time of the read barrier or written subsequently after the read barrier.
So, does volatile really guarantee that an up-to-date value is read or just (gasp!) that the values that are read are at least as up-to-date as the reads before the barrier? Or some other interpretation? What are the practical implications of this answer?

There are read barriers and write barriers; acquire barriers and release barriers. And more (io vs memory, etc).
The barriers are not there to control "latest" value or "freshness" of the values. They are there to control the relative ordering of memory accesses.
Write barriers control the order of writes. Because writes to memory are slow (compared to the speed of the CPU), there is usually a write-request queue where writes are posted before they 'really happen'. Although they are queued in order, while inside the queue the writes may be reordered. (So maybe 'queue' isn't the best name...) Unless you use write barriers to prevent the reordering.
Read barriers control the order of reads. Because of speculative execution (CPU looks ahead and loads from memory early) and because of the existence of the write buffer (the CPU will read a value from the write buffer instead of memory if it is there - ie the CPU thinks it just wrote X = 5, then why read it back, just see that it is still waiting to become 5 in the write buffer) reads may happen out of order.
This is true regardless of what the compiler tries to do with respect to the order of the generated code. ie 'volatile' in C++ won't help here, because it only tells the compiler to output code to re-read the value from "memory", it does NOT tell the CPU how/where to read it from (ie "memory" is many things at the CPU level).
So read/write barriers put up blocks to prevent reordering in the read/write queues (the read isn't usually so much of a queue, but the reordering effects are the same).
What kinds of blocks? - acquire and/or release blocks.
Acquire - eg read-acquire(x) will add the read of x into the read-queue and flush the queue (not really flush the queue, but add a marker saying don't reorder anything before this read, which is as if the queue was flushed). So later (in code order) reads can be reordered, but not before the read of x.
Release - eg write-release(x, 5) will flush (or marker) the queue first, then add the write-request to the write-queue. So earlier writes won't become reordered to happen after x = 5, but note that later writes can be reordered before x = 5.
Note that I paired the read with acquire and write with release because this is typical, but different combinations are possible.
Acquire and Release are considered 'half-barriers' or 'half-fences' because they only stop the reordering from going one way.
A full barrier (or full fence) applies both an acquire and a release - ie no reordering.
Typically for lockfree programming, or C# or java 'volatile', what you want/need is
read-acquire and write-release.
ie
void threadA()
{
foo->x = 10;
foo->y = 11;
foo->z = 12;
write_release(foo->ready, true);
bar = 13;
}
void threadB()
{
w = some_global;
ready = read_acquire(foo->ready);
if (ready)
{
q = w * foo->x * foo->y * foo->z;
}
else
calculate_pi();
}
So, first of all, this is a bad way to program threads. Locks would be safer. But just to illustrate barriers...
After threadA() is done writing foo, it needs to write foo->ready LAST, really last, else other threads might see foo->ready early and get the wrong values of x/y/z. So we use a write_release on foo->ready, which, as mentioned above, effectively 'flushes' the write queue (ensuring x,y,z are committed) then adds the ready=true request to the queue. And then adds the bar=13 request. Note that since we just used a release barrier (not a full) bar=13 may get written before ready. But we don't care! ie we are assuming bar is not changing shared data.
Now threadB() needs to know that when we say 'ready' we really mean ready. So we do a read_acquire(foo->ready). This read is added to the read queue, THEN the queue is flushed. Note that w = some_global may also still be in the queue. So foo->ready may be read before some_global. But again, we don't care, as it is not part of the important data that we are being so careful about.
What we do care about is foo->x/y/z. So they are added to the read queue after the acquire flush/marker, guaranteeing that they are read only after reading foo->ready.
Note also, that this is typically the exact same barriers used for locking and unlocking a mutex/CriticalSection/etc. (ie acquire on lock(), release on unlock() ).
So,
I'm pretty sure this (ie acquire/release) is exactly what MS docs say happens for read/writes of 'volatile' variables in C# (and optionally for MS C++, but this is non-standard). See http://msdn.microsoft.com/en-us/library/aa645755(VS.71).aspx including "A volatile read has "acquire semantics"; that is, it is guaranteed to occur prior to any references to memory that occur after it..."
I think java is the same, although I'm not as familiar. I suspect it is exactly the same, because you just don't typically need more guarantees than read-acquire/write-release.
In your question you were on the right track when thinking that it is really all about relative order - you just had the orderings backwards (ie "the values that are read are at least as up-to-date as the reads before the barrier? " - no, reads before the barrier are unimportant, its reads AFTER the barrier that are guaranteed to be AFTER, vice versa for writes).
And please note, as mentioned, reordering happens on both reads and writes, so only using a barrier on one thread and not the other WILL NOT WORK. ie a write-release isn't enough without the read-acquire. ie even if you write it in the right order, it could be read in the wrong order if you didn't use the read barriers to go with the write barriers.
And lastly, note that lock-free programming and CPU memory architectures can be actually much more complicated than that, but sticking with acquire/release will get you pretty far.

volatile in most programming languages does not imply a real CPU read memory barrier but an order to the compiler not to optimize the reads via caching in a register. This means that the reading process/thread will get the value "eventually". A common technique is to declare a boolean volatile flag to be set in a signal handler and checked in the main program loop.
In contrast CPU memory barriers are directly provided either via CPU instructions or implied with certain assembler mnemonics (such as lock prefix in x86) and are used for example when talking to hardware devices where order of reads and writes to memory-mapped IO registers is important or synchronizing memory access in multi-processing environment.
To answer your question - no, memory barrier does not guarantee "latest" value, but guarantees order of memory access operations. This is crucial for example in lock-free programming.
Here is one of the primers on CPU memory barriers.

Related

How to reproduce and test the function of memory barrier in X86 Linux [duplicate]

I've been studying the memory model and saw this (quote from https://research.swtch.com/hwmm):
Litmus Test: Write Queue (also called Store Buffer)
Can this program see r1 = 0, r2 = 0?
// Thread 1 // Thread 2
x = 1 y = 1
r1 = y r2 = x
On sequentially consistent hardware: no.
On x86 (or other TSO): yes!
Fact 1: This is the store buffer litmus test mentioned in many articles. They all say that both r1 and r2 being zero could happen on TSO because of the existence of the store buffer. They seem to assume that all the stores and loads are executed in order, and yet the result is both r1 and r2 being zero. This later concludes that "store/load reordering could happen", as a "consequence of the store buffer's existence".
Fact 2: However we know that OoO execution could also reorder the store and the load in both threads. In this sense, regardless of the store buffer, this reordering could result in both r1 and r2 being zero, as long as all four instructions retire without seeing each other's invalidation to x or y. And this seems to me that "store/load reordering could happen", just because "they are executed out of order". (I might be very wrong about this since this is the best I know of speculation and OoO execution.)
I wonder how these two facts converge (assuming I happen to be right about both): Is store buffer or OoO execution the reason for "store/load reordering", or both are?
Alternatively speaking: Say I somehow observed this litmus test on an x86 machine, was it because of the store buffer, or OoO execution? Or is it even possible to know which?
EDIT: Actually my major confusion is the unclear causality among the following points from various literatures:
OoO execution can cause the memory reordering;
Store/load reordering is caused by the store buffer and demonstrated by a litmus test (and thus named as "store buffer");
Some program having the exact same instructions as the store buffer litmus test is used as an observable OoO execution example, just as this article https://preshing.com/20120515/memory-reordering-caught-in-the-act does.
1 + 2 seems to imply that the store buffer is the cause, and OoO execution is the consequence. 3 + 1 seems to imply that OoO execution is the cause, and memory reordering is the consequence. I can no more tell which causes which. And it is that litmus test sitting in the middle of this mystery.

It makes some sense to call StoreLoad reordering an effect of the store buffer because the way to prevent it is with mfence or a locked instruction that drains the store buffer before later loads are allowed to read from cache. Merely serializing execution (with lfence) would not be sufficient, because the store buffer still exists. Note that even sfence ; lfence isn't sufficient.
Also I assume P5 Pentium (in-order dual-issue) has a store buffer, so SMP systems based on it could have this effect, in which case it would definitely be due to the store buffer. IDK how thoroughly the x86 memory model was documented in the early days before PPro even existed, but any naming of litmus tests done before that might well reflect in-order assumptions. (And naming after might include still-existing in-order systems.)
You can't tell which effect caused StoreLoad reordering. It's possible on a real x86 CPU (with a store buffer) for a later load to execute before the store has even written its address and data to the store buffer.
And yes, executing a store just means writing to the store buffer; it can't commit from the SB to L1d cache and become visible to other cores until after the store retires from the ROB (and thus is known to be non-speculative).
(Retirement happens in-order to support "precise exceptions". Otherwise, chaos ensues and discovering a mis-predict might mean rolling back the state of other cores, i.e. a design that's not sane. Can a speculatively executed CPU branch contain opcodes that access RAM? explains why a store buffer is necessary for OoO exec in general.)
I can't think of any detectable side-effect of the load uop executing before the store-data and/or store-address uops, or before the store retires, rather than after the store retires but before it commits to L1d cache.
You could force the latter case by putting an lfence between the store and the load, so the reordering is definitely caused by the store buffer. (A stronger barrier like mfence, a locked instruction, or a serializing instruction like cpuid, will all block the reordering entirely by draining the store buffer before the later load can execute. As an implementation detail, before it can even issue.)
A normal out of order exec treats all instructions as speculative, only becoming non-speculative when they retire from the ROB, which is done in program order to support precise exceptions. (See Out-of-order execution vs. speculative execution for a more in-depth exploration of that idea, in the context of Intel's Meltdown vulnerability.)
A hypothetical design with OoO exec but no store buffer would be possible. It would perform terribly, with each store having to wait for all previous instructions to be definitively known to not fault or otherwise be mispredicted / mis-speculated before the store can be allowed to execute.
This is not quite the same thing as saying that they need to have already executed, though (e.g. just executing the store-address uop of an earlier store would be enough to know it's non-faulting, or for a load doing the TLB/page-table checks will tell you it's non-faulting even if the data hasn't arrived yet). However, every branch instruction would need to be already executed (and known-correct), as would every ALU instruction like div that can.
Such a CPU also doesn't need to stop later loads from running before stores. A speculative load has no architectural effect / visibility, so it's ok if other cores see a share-request for a cache line which was the result of a mis-speculation. (On a memory region whose semantics allow that, such as normal WB write-back cacheable memory). That's why HW prefetching and speculative execution work in normal CPUs.
The memory model even allows StoreLoad ordering, so we're not speculating on memory ordering, only on the store (and other intervening instructions) not faulting. Which again is fine; speculative loads are always fine, it's speculative stores that we must not let other cores see. (So we can't do them at all if we don't have a store buffer or some other mechanism.)
(Fun fact: real x86 CPUs do speculate on memory ordering by doing loads out of order with each other, depending on addresses being ready or not, and on cache hit/miss. This can lead to memory order mis-speculation "machine clears" aka pipeline nukes (machine_clears.memory_ordering perf event) if another core wrote to a cache line between when it was actually read and the earliest the memory model said we could. Or even if we guess wrong about whether a load is going to reload something stored recently or not; memory disambiguation when addresses aren't ready yet involves dynamic prediction so you can provoke machine_clears.memory_ordering with single-threaded code.)
Out-of-order exec in P6 didn't introduce any new kinds of memory re-ordering because that could have broken existing multi-threaded binaries. (At that time mostly just OS kernels, I'd guess!) That's why early loads have to be speculative if done at all. x86's main reason for existence it backwards compat; back then it wasn't the performance king.
Re: why this litmus test exists at all, if that's what you mean?
Obviously to highlight something that can happen on x86.
Is StoreLoad reordering important? Usually it's not a problem; acquire / release synchronization is sufficient for most inter-thread communication about a buffer being ready to read, or more generally a lock-free queue. Or to implement mutexes. ISO C++ only guarantees that mutexes lock / unlock are acquire and release operations, not seq_cst.
It's pretty rare that an algorithm depends on draining the store buffer before a later load.
Say I somehow observed this litmus test on an x86 machine,
Fully working program that verifies that this reordering is possible in real life on real x86 CPUs: https://preshing.com/20120515/memory-reordering-caught-in-the-act/. (The rest of Preshing's articles on memory ordering are also excellent. Great for getting a conceptual understanding of inter-thread communication via lockless operations.)

Do I need always lock before read even if very occasionally write to the memory?

I have a shared data structure that is read in one thread and modified in another thread. However, its data changes very occasionally. Most of time, it is read by the thread. I now have a Mutex (or RW lock) locked before read/write and unlocked after read/write.
Because the data rarely changes, lock-unlock every time it is read seems inefficient. If no change is made to the data, I can get rid of the lock because only read to the same structure can run simultaneously without lock.
My question is:
Is there a lock-free solution that allows me changes the data without a lock?
Or, the lock-unlock in read (one thread, in other words, no contention) don't take much of time/resources (no enter to the kernel) at all?

If there's no contention, not kernel call is needed, but still atomic lock acquisition is needed. If the resource is occupied for a short period of time, then spinning can be attempted before kernel call.
Mutex and RW lock implementations, such as (an usual quality implementation of) std::mutex / std::shared_mutex in C++ or CRITICAL_SECTION / SRW_LOCK in Windows already employ above mentioned techniques on their own. Linux mutexes are usually based on futex, so they also avoid kernel call when it its not needed. So you don't need to bother about saving a kernel call yourself.
And there are alternatives to locking. There are atomic types that can be accessed using lock-free reads and writes, they can always avoid lock. There are other patterns, such as SeqLock. There is transaction memory.
But before going there, you should make sure that locking is performance problem. Because use of atomics may be not simple (although it is simple for some languages and simple cases), and other alternatives have their own pitfalls.
An uncontrolled data race may be dangerous. Maybe not. And there may be very thin boundary between cases where it is and where it is not. For example, copying a bunch of integer could only result in garbage integers occasionally obtained, if integers are properly sized and aligned, then there may be only a mix up, but not garbage value of a single integer, and if you add some more complex type, say string, you may have a crash. So most of the times uncontrolled data race is treated as Undefined Behavior.

Is synchronization for variable change cheaper then for something else?

In a multi-threading environment, isn’t it that every operation on the RAM must be synchronized?
Let’s say, I have a variable, which is a pointer to another memory address:
foo 12345678
Now, if one thread sets that variable to another memory address (let’s say 89ABCDEF), meanwhile the first thread reads the variable, couldn’t it be that the first thread reads totally trash from the variable if access wouldn’t be synchronized (on some system level)?
foo 12345678 (before)
89ABCDEF (new data)
••••• (writing thread progress)
89ABC678 (memory content)
Since I never saw those things happen I assume that there is some system level synchronization when writing variables. I assume, that this is why it is called an ‘atomic’ operation. As I found here, this problem is actually a topic and not totally fictious from me.
On the other hand, I read everywhere that synchronizing has a significant impact on performance. (Aside from threads that must wait bc. they cannot enter the lock; I mean just the action of locking and unlocking.) Like here:
synchronized adds a significant overhead to the methods […]. These operations are quite expensive […] it has an extreme impact on the program performance. […] the expensive synchronized operations that cause the code to be so terribly slow.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?

Concerning your first point, when a processor writes some data to memory, this data is always properly written and cannot be "trashed" by other writes by threads processes, OS, etc. It is not a matter of synchronization, just required to insure proper hardware behaviour.
Synchronization is a software concept that requires hardware support. Assume that you just want to acquire a lock. It is supposed to be free when at 0 et locked when at 1.
The basic method to do that is
got_the_lock=0
while(!got_the_lock)
fetch lock value from memory
set lock value in memory to 1
got_the_lock = (fetched value from memory == 0)
done
print "I got the lock!!"
The problem is that if other threads do the same thing at the same time and read lock value before it has been set to 1, several threads may think they got the lock.
To avoid that, one need atomic memory access. An atomic access is typically a read-modify-write cycle to a data in memory that cannot interrupted and that forbids access to this information until completion. So not all accesses are atomic, only specific read-modify-write operation and it is realized thanks tp specific processor support (see test-and-set or fetch-and-add instructions, for instance). Most accesses do not need it and can be a regular access. Atomic access is mostly use to synchronize threads to insure that only one thread is in a critical section.
So why are atomic access expensive ? There are several reasons.
The first one is that one must ensure a proper ordering of instructions. You probably know that instruction order may be different from instruction program order, provided the semantic of the program is respected. This is heavily exploited to improve performances : compiler reorder instructions, processor execute them out-of-order, write-back caches write data in memory in any order, and memory write buffer do the same thing. This reordering can lead to improper behavior.
1 while (x--) ; // random and silly loop
2 f(y);
3 while(test_and_set(important_lock)) ; //spinlock to get a lock
4 g(z);
Obviously instruction 1 is not constraining and 2 can be executed before (and probably 1 will be removed by an optimizing compiler). But if 4 is executed before 3, the behavior will not be as expected.
To avoid that, an atomic access flushes the instruction and memory buffer that requires tens of cycles (see memory barrier).
Without pipeline, you pay the full latency of the operation: read data from memory, modify it and write it back. This latency always happens, but for regular memory accesses you can do other work during that time that largely hides the latency.
An atomic access requires at least 100-200 cycles on modern processors and is accordingly extremely expensive.
How does this go together? Why is locking for changing a variable unnoticeable fast, but locking for anything else so expensive? Or, is it equally expensive, and there should be a big warning sign when using—let’s say—long and double because they always implicitly require synchronization?
Regular memory access are not atomic. Only specific synchronization instructions are expensive.

Synchronization always has a cost involved. And the cost increases with contention due to threads waking up, fighting for lock and only one gets it, and the rest go to sleep resulting in lot of context switches.
However, such contention can be kept at a minimum by using synchronization at a much granular level as in a CAS (compare and swap) operation by CPU, or a memory barrier to read a volatile variable. A far better option is to avoid synchronization altogether without compromising safety.
Consider the following code:
synchronized(this) {
// a DB call
}
This block of code will take several seconds to execute as it is doing a IO and therefore run high chance of creating a contention among other threads wanting to execute the same block. The time duration is enough to build up a massive queue of waiting threads in a busy system.
This is the reason the non-blocking algorithms like Treiber Stack Michael Scott exist. They do a their tasks (which we'd otherwise do using a much larger synchronized block) with the minimum amount of synchronization.

isn’t it that every operation on the RAM must be synchronized?
No. Most of the "operations on RAM" will target memory locations that are only used by one thread. For example, in most programming languages, None of a thread's function arguments or local variables will be shared with other threads; and often, a thread will use heap objects that it does not share with any other thread.
You need synchronization when two or more threads communicate with one another through shared variables. There are two parts to it:
mutual exclusion
You may need to prevent "race conditions." If some thread T updates a data structure, it may have to put the structure into a temporary, invalid state before the update is complete. You can use mutual exclusion (i.e., mutexes/semaphores/locks/critical sections) to ensure that no other thread U can see the data structure when it is in that temporary, invalid state.
cache consistency
On a computer with more than one CPU, each processor typically has its own memory cache. So, when two different threads running on two different processors both access the same data, they may each be looking at their own, separately cached copy. Thus, when thread T updates that shared data structure, it is important to ensure that all of the variables it updated make it into thread U's cache before thread U is allowed to see any of them.
It would totally defeat the purpose of the separate caches if every write by one processor invalidated every other processor's cache, so there typically are special hardware instructions to do that only when it's needed, and typical mutex/lock implementations execute those instructions on entering or leaving a protected block of code.

Deciding the critical section of kernel code

Hi I am writing kernel code which intends to do process scheduling and multi-threaded execution. I've studied about locking mechanisms and their functionality. Is there a thumb rule regarding what sort of data structure in critical section should be protected by locking (mutex/semaphores/spinlocks)?
I know that where ever there is chance of concurrency in part of code, we require lock. But how do we decide, what if we miss and test cases don't catch them. Earlier I wrote code for system calls and file systems where I never cared about taking locks.

Is there a thumb rule regarding what sort of data structure in critical section should be protected by locking?
Any object (global variable, field of the structure object, etc.), accessed concurrently when one access is write access requires some locking discipline for access.
But how do we decide, what if we miss and test cases don't catch them?
Good practice is appropriate comment for every declaration of variable, structure, or structure field, which requires locking discipline for access. Anyone, who uses this variable, reads this comment and writes corresponded code for access. Kernel core and modules tend to follow this strategy.
As for testing, common testing rarely reveals concurrency issues because of their low probability. When testing kernel modules, I would advice to use Kernel Strider, which attempts to prove correctness of concurrent memory accesses or RaceHound, which increases probability of concurrent issues and checks them.

It is always safe to grab a lock for the duration of any code that accesses any shared data, but this is slow since it means only one thread at a time can run significant chunks of code.
Depending on the data in question though, there may be shortcuts that are safe and fast. If it is a simple integer ( and by integer I mean the native word size of the CPU, i.e. not a 64 bit on a 32 bit cpu ), then you may not need to do any locking: if one thread tries to write to the integer, and the other reads it at the same time, the reader will either get the old value, or the new value, never a mix of the two. If the reader doesn't care that he got the old value, then there is no need for a lock.
If however, you are updating two integers together, and it would be bad for the reader to get the new value for one and the old value for the other, then you need a lock. Another example is if the thread is incrementing the integer. That normally involves a read, add, and write. If one reads the old value, then the other manages to read, add, and write the new value, then the first thread adds and writes the new value, both believe they have incremented the variable, but instead of being incremented twice, it was only incremented once. This needs either a lock, or the use of an atomic increment primitive to ensure that the read/modify/write cycle can not be interrupted. There are also atomic test-and-set primitives so you can read a value, do some math on it, then try to write it back, but the write only succeeds if it still holds the original value. That is, if another thread changed it since the time you read it, the test-and-set will fail, then you can discard your new value and start over with a read of the value the other thread set and try to test-and-set it again.
Pointers are really just integers, so if you set up a data structure then store a pointer to it where another thread can find it, you don't need a lock as long as you set up the structure fully before you store its address in the pointer. Another thread reading the pointer ( it will need to make sure to read the pointer only once, i.e. by storing it in a local variable then using only that to refer to the structure from then on ) will either see the new structure, or the old one, but never an intermediate state. If most threads only read the structure via the pointer, and any that want to write do so either with a lock, or an atomic test-and-set of the pointer, this is sufficient. Any time you want to modify any member of the structure though, you have to copy it to a new one, change the new one, then update the pointer. This is essentially how the kernel's RCU ( read, copy, update ) mechanism works.

Ideally, you must enumerate all the resources available in your system , the related threads and communication, sharing mechanism during design. Determination of the following for every resource and maintaining a proper check list whenever change is made can be of great help :
The duration for which the resource will be busy (Utilization of resource) & type of lock
Amount of tasks queued upon that particular resource (Load) & priority
Type of communication, sharing mechanism related to resource
Error conditions related to resource
If possible, it is better to have a flow diagram depicting the resources, utilization, locks, load, communication/sharing mechanism and errors.
This process can help you in determining the missing scenarios/unknowns, critical sections and also in identification of bottlenecks.
On top of the above process, you may also need certain tools that can help you in testing / further analysis to rule out hidden problems if any :
Helgrind - a Valgrind tool for detecting synchronisation errors.
This can help in identifying data races/synchronization issues due
to improper locking, the lock ordering that can cause deadlocks and
also improper POSIX thread API usage that can have later impacts.
Refer : http://valgrind.org/docs/manual/hg-manual.html
Locksmith - For determining common lock errors that may arise during
runtime or that may cause deadlocks.
ThreadSanitizer - For detecting race condtion. Shall display all accesses & locks involved for all accesses.
Sparse can help to lists the locks acquired and released by a function and also identification of issues such as mixing of pointers to user address space and pointers to kernel address space.
Lockdep - For debugging of locks
iotop - For determining the current I/O usage by processes or threads on the system by monitoring the I/O usage information output by the kernel.
LTTng - For tracing race conditions and interrupt cascades possible. (A successor to LTT - Combination of kprobes, tracepoint and perf functionalities)
Ftrace - A Linux kernel internal tracer for analysing /debugging latency and performance related issues.
lsof and fuser can be handy in determining the processes having lock and the kind of locks.
Profiling can help in determining where exactly the time is being spent by the kernel. This can be done with tools like perf, Oprofile.
The strace can intercept/record system calls that are called by a process and also the signals that are received by a process. It shall show the order of events and all the return/resumption paths of calls.

How locking is implemented?

i have following code:
while(lock)
;
lock = 1;
// critical section
lock = 0;
As reading or changing lock value is in itself a multi-instruction
read lock
change value
write it
If it happens like:
1) One thread reads the lock and stops there
2) Another thread reads it and sees it is free; lock it and do something untill half
3) First thread wakes up and goes into CS
SO how would locking would be implmented in system ?
Placing variables over top of another variables is not right : it would be like Guarding the guard ?
Stopping other processors threads is also not right ?

It is 100% platform specific. Generally, the CPU provides some form of atomic operation such as exchange or compare and swap. A typical lock might work like this:
1) Create: Store 0 (unlocked) in the variable.
2) Lock: Atomically attempt to switch the value of the variable from 0 (unlocked) to 1 (locked). If we failed (because it wasn't unlocked to begin with), let the CPU rest a bit, and then retry. Use a memory barrier to ensure no future memory operations sneak behind this one.
3) Unlock: Use a memory barrier to ensure previous memory operations don't sneak past this one. Atomically write 0 (unlocked) to the variable.
Note that you really don't need to understand this unless you want to design your own synchronization primitives. And if you want to do that, you need to understand an awful lot more. It's certainly a good idea for every programmer to have a general idea of what he's making the hardware do. But this is an area filled with seriously heavy wizardry. There are so many, many ways this can go horribly wrong. So just use the locking primitives provided by the geniuses who made your platform, compiler, and threading library. Here be dragons.
For example, SMP Pentium Pro systems have an erratum that requires special handling in the unlock operation. A naive implementation of the lock algorithm will cause the branch prediction logic to expect the operation to keep spinning, incurring a massive performance penalty at the worst possible time -- when you first acquire the lock. A naive implementation of the lock algorithm may cause two cores each waiting for the same lock to saturate the bus, slowing the CPU that needs to get work done in order to release the lock to a crawl. These all require heavy wizardry and deep understanding of the hardware to deal with.

In a course I studied at Uni, a possible firmware solution for implementing locks was presented in the form of the "atomicity bit" associated to a memory operation initiated by a processor.
Basically, when locking, you'll notice that you have a sequence of operations that need to be executed atomically: test the value of the flag and, if not set, set it to locked, otherwise try again. This sequence can be made atomic by associating a bit with each memory request send by the CPU. The first N-1 operations will have the bit set, while the last one will have it unset, to mark the end of the atomic sequence.
When the memory module (there can be several modules) where the flag data is stored receives the request for the first operation in the sequence (whose bit is set), it will serve it and not take requests from any other CPU until the CPU that initiated the atomic sequence sends a request with an unset atomicity bit (since these transactions are usually short, a coarse-grain approach like this is acceptable). Note that this is usually made easier by the assembler providing specialized instructions of type "compare-and-set", that do exactly what I mentioned before.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string