What happens when multiple threads try to access a critical section exactly at the same time?

I've being trying to find an answer for that, and all I could find it that once a thread reaches a critical section it locks it in front of other threads (or some other lock mechanism is being used to lock the critical section).
But that implies that the threads didn't really reach the CS exactly at the same microsecond.
Although I guess it is quite rare, can it really happen, and what happens in this situation?
Can I simply assume the the program will malfunction?
Note: I am referencing to a multicore CPUs.

I think you are missing the point of the fundamental locking primitives like Semaphores. If correct primitive is used, and used correctly, then the timing of the threads do not matter. They may well be simultaneous. The Operating System guarantees that no two thread will enter the critical section. Even on multicore machines, this bit is specially implemented (with lots of trickery even) to get that assurance.
To address your concerns specifically:
But that implies that the threads didn't really reach the CS exactly at the same microsecond.
No. The other threads could have reached in the same microsecond, BUT if the locking mechanism is correct, then only one the competing threads will "enter" the critical section and others will wait.
Rare or not, if the correct locking primitive is used, and used correctly, then no two threads will enter the critical section.
Rare or not, if the correct locking primitive is used, and used correctly, then no two threads will enter the critical section.
Can I simply assume the the program will malfunction?
Ideally the program should not malfunction. But any code will have bugs - so does your code and the Operating System code for the Semaphores. So it is safe to assume that in some edge cases the program will indeed malfunction. But this assumption is true for any code in general.
Locking and Critical Sections are rather tricky to correctly implement. So for non academic purposes we should always use the system provided locking primitives. All Operating Systems expose stuff like Semaphores which most programming languages have ways to use. Some programming languages have their own lightweight implementations which provide somewhat softer guarantees but at a higher performance. As I said, while doing Critical Sections, it is critical to choose the correct thing and also to implement it correctly.

...But that implies that the threads didn't really reach the CS exactly at the same microsecond.
Short answer; Memory system hardware makes it impossible for two different processors to access the same memory location at the same time. I'm not a computer architect, so I can't explain how it works, but the memory system serializes all of the accesses to the shared, main memory by the various CPUs in a multi-CPU system.
"Entering a critical section" means locking a mutex, and a mutex basically is just a flag in shared memory that is accesses by a specific protocol.

It is the task of the cache coherence protocol to make sure there are no 2 writes on the same chunk of memory (cache line) at the same time. With MESI there can be multiple readers of the same cacheline, but only 1 writer.
So if 2 threads at the same time want to write to the same cacheline, their requests will be serialized by cache coherence protocol.
Most CPU architecture support atomic operations like CAS. On the X86 this can be done using a lock prefix. The CPU will lock the cacheline when it starts with the CAS instruction and will not respond to cache coherence requests from other cores till it is finished with the atomic operation.
So if you would have 2 CPUs that both want to do a CAS, these operations are serialized by the underlying hardware.


What is the difference between an atomic operation and critical section?, which of the two prevents context switching?

A programming language or the processor already has "default" atomic operations and we can use them as far as I understand.
What is the difference between an atomic operation and critical section?
Atomic operations are instructions that guarantee atomic accesses/updates of shared (small) variables. This generally include operations like incrementation, decrementation, addition, subtraction, compare and swap (aka. CAS), exchange, logical operations (and, or, xor) as well as basic loads/stores. If you want to perform a non trivial operation that is not supported by the target platform (or one involving large variables), then you cannot use one atomic operation. This means either multiple of them is required or another mechanism should be used instead (eg. critical section, transactional memory). Note that using multiple atomic operations often makes things significantly more complex (see ABA problem). On mainstream CPUs, atomic operations are generally implemented by locking cache lines of shared caches (eg. L3) so that only one thread can access to it at a time.
Critical sections are meant to protect one or multiple instructions from being executed by multiple threads at the same time. They are generally protected using a system mutex. The thread entering the critical section lock the associated mutex and unlock it when leaving the section. System mutexes cause the thread entering a critical section to wait if the associated mutex is already locked. This is generally done using a context switch (the thread is descheduled and rescheduled later).
Critical section can be efficient when the lock is very rarely already taken by another thread. Context switches can significantly impact the performance. Atomic operation are not great either when many thread perform atomic operations on it. Contention effects can make atomic accesses significantly slower (eg. spin locks). This is especially true for atomic CAS operations. Some platform can execute atomic operation very quickly (eg. GPUs) since they have dedicated units to execute atomic operation efficiently.
which of the two prevents context switching?
None of the two prevent context switching. Modern operating systems can perform a context switching at any time. That being said, critical section generally cause context switches: a thread trying to enter into a critical section already locked by another thread will typically enter in sleeping mode and be awaken by the OS scheduler when the other thread will unlock the section. Atomic operations do not impact the scheduling of the system (at least not on mainstream platforms).
Note that the above text is also true for processes.
Speaking only to the nomenclature question:
"Atomic" means "cannot be broken down into smaller parts." In programming, an operation performed by one thread is "atomic" (as seen from other threads) if there is no possible way for the other threads to see the operation in a half-way done state. From the point of view of other threads, it's as if the entire operation happened in a single instant. It either has already happened, or it hasn't happened yet. There is no in between.
As Jérôme Richard points out, modern computer hardware provides atomic operations on simple variables. We can use those to make more complex operations seem "atomic" from the point of view of other threads either by using the hardware atomics in tricky non-blocking algorithms, or by using the hardware atomics in the implementation of mutex locks.
"Critical section" comes from a time before multi-threading. In operating system kernel code, and in "bare metal" application code, there has always been a limited form of concurrency between the main body of code and the interrupt handlers. "Critical section," back in the day, referred to a routine in the main body of code that was protected from interference by the interrupt handlers by executing it with interrupts disabled.
Systems programmers today still use "critical section" with the original meaning, but now we also sometimes say it to talk about a routine that is executed by a thread while the thread has a mutex locked.
IMO, "critical section" encourages a somewhat less useful way of thinking about mutex locks though because it's never the code that needs protection from interference. It's always about protecting the integrity of shared data. Sometimes a programmer who worries about defining The critical section can lose sight of the fact that there may be multiple routines in the program that all access the same shared data.
IMO, this is one place where an object-oriented style of programming shines, because it's easier to keep track of what needs to be protected if it is encapsulated in private members of some object and, can only be accessed through the object's thread-safe, public methods.

Guaranteed CPU cache update after a certain time

Let's say I have a variable var located somewhere in memory and that an arbitrary number of processors/threads could read and modify it at any given time. But it's guaranteed that at least n seconds will have elapsed between a processor modifying var and any other one reading var. Is it possible to be certain that, if time in seconds is n, there's a value for n that guarantees that the processor reading var will read the updated value?
If your concern really is Cache coherence you should generally be safe 1.
Specifically, however, you may be not.
Cache coherence is usually handled by the hardware2 without the help of the software.
However this is very implementation specific: NUMA may be non cache-coherent, a Compute Shader may need specific built-in functions, IA32e and ARM generally hide cache coherence from the programmer.
To answer you question directly: No, you have no guarantees whatsoever.
The point is that cache coherence is something you deal with in clustered and parallel non uniform architectures.
While in this situations the programming model is inherently multi-threading, the two concepts3 are separated and what really should bug you is how to properly handle multi-threading, specifically synchronization and memory order.
Your question seems to suggest a simple case, where the readers are executed long after the writer is done.
If this property is really enforced you don't need any synchronization nor memory barrier. Beware however that sleep functions don't qualify as a valid enforcement.
If you instead need to synchronize (and so to order the memory accesses) then you need to use language specific constructs, for example volatile in C# and Java, atomics in C and C++ or specific instructions in assembly.
You may need to implement Critical sections too.
If you actually need to manually control the cache coherence for your architecture, than you have to check the specifications of interest (usually datasheets and formal papers) because there is no uniform way to deal with it and the compiler should provide some intrinsic or the runtime should provide a library.
So to add something to the direct answer above: No, you have no guarantees whatsoever, but when an usual CPU, in an usual architecture, need that data, it will be able to use the most updated one anyway. So you don't need to worry about that aspect.
Please note the use of the words common and that
1 For example if you use an Intel/AMD/ARM CPU, don't even think about cache coherence.
2 Either the CPU itself, a local monitor, a system monitor or a specific device.
3 Multi-threading and cache-coherence.
The cache will tend to get flushed on operating system tick interrupts when it goes into the scheduler to see if there's a different task to run.
However, as operating systems get smarter with things such as tickless NoHz and as CPU core counts go up, this gets less and less likely, and you shouldn't count on it.
Supercomputer clusters may not task switch for minutes at a time because they're using customized operating system code that doesn't interrupt the running jobs, ever. Compute jobs are assigned to a core from 1-7 with no interrupts and all of the other work runs on core-0.
There are two concepts mixed in you question: software synchronization and hardware coherency. Hardware coherency is talked by Margaret already so I won't cover it here.
Software Synchronization
x86 provides guarantee that quadword access would be carried out atomically if aligned on 64-bit boundary. But this guarantees that other processor won't read partial result (e.g. [32bit New]<32bit Old> weird mixture). It does not guarantee a hard time deadline before which another processor would see the newly assigned value. Let another thread wait for some time is not quite an elegant solution because first the two threads need to have the same starting time synchronized. So, if you need such guarantee, you need conditional variable to make sure another thread should wait.
In a word, use conditional variable if you need a sequencing effect and use locks/transactional memory, etc. to protect variable longer than quadword or not 64bit aligned.
Btw, here is an useful material for cache coherency if you are interested.

Why cannot a Lock for `2`-threads be implemented using only `1` shared variable satisfying mutual exclusion and deadlock freedom?

I've been working a lot with concurrency at the practical level, and therefore I've also started to study it theoretically to gain insight into this field of computer science.
However, I've trouble understanding the following:
Why cannot a Lock for 2-threads be implemented using only 1 shared variable satisfying mutual exclusion and deadlock freedom?
More generally, why is at least n shared variables needed for a n-thread lock satisfying mutual exclusion and deadlock freedom?
Consider two threads A and B. I see that A must write to this variable in order to signify it acquires the lock. The variable could be a boolean. Is it because that A needs to read the variable before writing it, and this is two operations? (not done atomically)
Most likely, you're reading things that make assumptions about the platform's capabilities that are no longer realistic. You're probably considering the case where a CPU has no prefetching, no posted writes, total read and store ordering, no compiler optimization that affect memory visibility or memory operation ordering, and no risk of word tearing, but does not have an atomic "read-modify-write" operation like increment or compare-exchange. With these assumptions, there's really no way to do it with one variable.
This is an interesting theoretical problem, but has very little practical relevance. Modern CPUs do have all of those optimizations -- they prefetch reads, they post writes to buffers, they re-order reads and stores, and compilers optimize away memory options. Word tearing is typically not an issue for aligned operations to native integer types. But, more importantly, modern CPUs have sophisticated, high-performance atomic operations such as increment, decrement, compare-exchange, and so on.
When you write synchronization primitives, the exercise is highly platform-specific. The combination of capabilities available to you varies from platform to platform. Even more importantly, their costs vary drastically from platform to platform, so even if many solutions are possible, they may not be equally good.
Lastly, you have to have a deep understanding of what each primitive actually makes the platform do. For example, on modern Intel CPUs, there is hyper-threading. It's important that, for example, a thread waiting for a spinlock doesn't starve another thread sharing the physical core. That requires deep understanding of how hyper-threading actually works. Similarly, it's easy to code a spinlock so that you take the mother of all mispredicted branches when you acquire the lock and blow out the pipelines at the instant where performance is the most critical. You need to understand how branch prediction works and how it interacts with instruction pipelining to avoid this issue.
The vast majority of programmers should never, ever write synchronization primitives and use them in actual, real world code. Getting them to work with assured reliability is hard, and getting them to perform properly is much, much harder. And to top it off, it's not possible to measure their performance easily. (Of course, it's great to experiment, so long as you don't get an exaggerated sense of the usefulness of your experimental code.)

Critical sections with multicore processors

With a single-core processor, where all your threads are run from the one single CPU, the idea of implementing a critical section using an atomic test-and-set operation on some mutex (or semaphore or etc) in memory seems straightforward enough; because your processor is executing a test-and-set from one spot in your program, it necessarily can't be doing one from another spot in your program disguised as some other thread.
But what happens when you do actually have more than one physical processor? It seems that simple instruction level atomicity wouldn't be sufficient, b/c with two processors potentially executing their test-and-set operations at the same time, what you really need to maintain atomicity on is access to the shared memory location of the mutex. (And if the shared memory location is loaded into cache, there's the whole cache consistency thing to deal with, too..)
This seems like it would incur far more overhead than the single core case, so here's the meat of the question: How much worse is it? Is it worse? Do we just live with it? Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
Multi-core/SMP systems are not just several CPUs glued together. There's explicit support for doing things in parallel. All the synchronization primitives are implemented with the help of hardware along the lines of atomic CAS. The instruction either locks the bus shared by CPUs and memory controller (and devices that do DMA) and updates the memory, or just updates the memory relying on cache snooping. This in turn causes cache coherency algorithm to kick in forcing all involved parties to flush their caches.Disclaimer - this is very basic description, there are more interesting things here like virtual vs. physical caches, cache write-back policies, memory models, fences, etc. etc.
If you want to know more about how OS might use these hardware facilities - here's an excellent book on the subject.
The vendor of multi-core cpus has to take care that the different cores coordinate themselves when executing instructions which guarantee atomic memory access.
On intel chips for instance you have the 'cmpxchg' instruction. It compares the value stored at a memory location to an expected value and exchanges it for the new value if the two match. If you precede it with the 'lock' instruction, it is guaranteed to be atomic with respect to all cores.
You would need a test-and-set that forces the processor to notify all the other cores of the operation so that they are aware. Yes, that introduces an overhead and you have to live with it. It's a reason to design multithreaded applications in such a way that they don't wait for synchronization primitives too often.
Or sidestep it by enforcing a policy that all threads within a process group have to live on the same physical core?
That would cancel the whole point of multithreading. When you are using a lock, semaphore, or other syncronization techniques, you are relying on OS to make sure that these operations are interlocked, no matter how many cores you are using.
The time to switch to a different thread after a lock has been released is mostly determined by the cost of a context switch. This SO thread deals with the context switching overhead, so you might want to check that.
Memory accesses are handled by the memory controller which should take care of multi-core issues, i.e. it shouldn't allow simultaneous access to same addresses (probably handled either by memory page or memory line basis). So you can use a flag to indicate whether another processor is updating the memory contents of some block (this to avoid a type of dirty read where part of the record is updated, but not all).
A more elegant solution is to use a HW semaphore block if the processor has such a feature. A HW semaphore is a simple queue which could be of size no_of_cores -1. This is how it is in TI's 6487/8 processor. You can either query the semaphore directly (and loop until it is released) or do an indirect query which will result in an interrupt once your core gets the resource. The requests are queued and served in the order they were made. A semaphore query is an atomic operation.
Cache consistency is another issue and you might need to do cache writebacks and refreshes in some cases. But this is a very cache implementation specific thing. With 6487/8 we needed to do that on a few operations.
Well, depending on what type of computers you have laying around the house, do the following: Write a simple multithreaded application. Run this application on a single core (Pentium 4 or Core Solo) and then run it on a multicore processor (Core 2 Duo or similar) and see how big the speed up is.
Granted these are unfair comparisons since Pentium 4 and Core Solo are much slower regardless of cores than a Core 2 Duo. Maybe compare between a Core 2 Duo and a Core 2 Quad with an application that can use 4 or more threads.
You raise a number of valid points. Muliple processors introduce a lot of headache and overhead. However, we just have to live with them, because the speed boost of parallelism can far outweigh them, if the critical sections are made long enough.
As for your final suggestion about having all threads on the same physical core, that completely defeats the point of a multi-core computer!

Why lock may become a bottleneck of multithreaded program?

Why lock may become a bottleneck of multithreaded program?
If I want my queue frequently pop() and push() by multithread,
which lock should I use?
The lock you use depends on your platform but will generally be some flavour of mutex. On windows, you would use a critical section and in .NET, you'd use a monitor. I'm not very familiar with locking mechanisms on other platforms. I'd stay away from lock free approaches. They are very difficult to program correctly and the performance gains are often not as great as you would expect.
Locks become a bottleneck in your program when they are under heavy contention. That is, a very large number of threads all try to acquire the lock at the same time. This wastes a lot of CPU cycles as threads become blocked and the OS spends a greater and greater portion of its time switching between threads. This sort of problem most frequently manifests itself in the server world. For desktop applications, it's rare that locks will cause a performance issue.
"Why lock may become a bottleneck of multithreaded program?" - think of a turnstile (also called a baffle gate), which only lets one person through at a time, with a crowd of people waiting to go through it.
For a queue, use the simplest lock your environment has to offer.
For a queue, it is easy to write a lock-free implementation (google away)
Locks are bottlenecks because they force all other threads which encounter them to stop doing what they're doing and wait for the lock to open, thus wasting time. One of the ideas behind multithreading is to use as many processors as possible at any given time. By forcing threads to wait on the locks the application essentially gives up processing power which it might have used.
"Why lock may become a bottleneck of multithreaded program?"
Because waiting threads remain blocked until shared memory is unlocked.
Suggest you read this article on "Concurrency: What Every Dev Must Know About Multithreaded Apps" http://msdn.microsoft.com/en-au/magazine/cc163744.aspx
Locks are expensive both because they require operating system calls in the middle of your algorithm and because they are hard to do properly when creating the CPU.
As a programmer, it is best to leave the locks in the middle of your data structures to the experts and instead use a good multithreaded library such as Intel's TBB
For Queues, you would want to use Atomic instructions (hard) or a spinlock (easier) if possible because they are cheap compared to a mutex. Use a mutex if you are doing a lot of work that needs to be locked, i.e modify a complex tree structure
In the threading packages that I'm familiar with, your options for mutexes are recursive and non-recursive. You should opt for non-recursive -- all of your accesses will be lock(); queue_op(); unlock(), so there's no need to be able to acquire the lock twice.
