rcu_dereference() vs rcu_dereference_protected()? - linux

Can any one explain what is the difference between rcu_dereference() and rcu_dereference_protected()?
rcu_dereference() contains barrier code and rcu_dereference_protected() do not contain.
When to use rcu_dereference() and when to use rcu_dereference_protected()?

In short:
rcu_dereference() should be used at read-side, protected by rcu_read_lock() or similar.
rcu_dereference_protected() should be used at write-side (update-side) by the single writer, or protected by the lock which prevents several writers from concurrent modification of the dereferenced pointer. In such cases pointer cannot be modified outside of the current thread, so neither compiler- nor cpu-barriers are needed.
If doubt, using rcu_dereference is always safe, and its perfomance penalties (compared to rcu_dereference_protected) are low.
Exact description for rcu_dereference_protected in the kernel 4.6:
/**
* rcu_dereference_protected() - fetch RCU pointer when updates prevented
* #p: The pointer to read, prior to dereferencing
* #c: The conditions under which the dereference will take place
*
* Return the value of the specified RCU-protected pointer, but omit
* both the smp_read_barrier_depends() and the READ_ONCE(). This
* is useful in cases where update-side locks prevent the value of the
* pointer from changing. Please note that this primitive does -not-
* prevent the compiler from repeating this reference or combining it
* with other references, so it should not be used without protection
* of appropriate locks.
*
* This function is only for update-side use. Using this function
* when protected only by rcu_read_lock() will result in infrequent
* but very ugly failures.
*/

Related

Memory barrier in the implementation of single producer single consumer

The following implementation from Wikipedia:
volatile unsigned int produceCount = 0, consumeCount = 0;
TokenType buffer[BUFFER_SIZE];
void producer(void) {
while (1) {
while (produceCount - consumeCount == BUFFER_SIZE)
sched_yield(); // buffer is full
buffer[produceCount % BUFFER_SIZE] = produceToken();
// a memory_barrier should go here, see the explanation above
++produceCount;
}
}
void consumer(void) {
while (1) {
while (produceCount - consumeCount == 0)
sched_yield(); // buffer is empty
consumeToken(buffer[consumeCount % BUFFER_SIZE]);
// a memory_barrier should go here, the explanation above still applies
++consumeCount;
}
}
says that a memory barrier must be used between the line that accesses the buffer and the line that updates the Count variable.
This is done to prevent the CPU from reordering the instructions above the fence along-with that below it. The Count variable shouldn't be incremented before it is used to index into the buffer.
If a fence is not used, won't this kind of reordering violate the correctness of code? The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Thanks
If a fence is not used, won't this kind of reordering violate the correctness of code? The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Good question.
In c++, unless some form of memory barrier is used (atomic, mutex, etc), the compiler assumes that the code is single-threaded. In which case, the as-if rule says that the compiler may emit whatever code it likes, provided that the overall observable effect is 'as if' your code was executed sequentially.
As mentioned in the comments, volatile does not necessarily alter this, being merely an implementation-defined hint that the variable may change between accesses (this is not the same as being modified by another thread).
So if you write multi-threaded code without memory barriers, you get no guarantees that changes to a variable in one thread will even be observed by another thread, because as far as the compiler is concerned that other thread should not be touching the same memory, ever.
What you will actually observe is undefined behaviour.
It seems, that your question is "can incrementing Count and assigment to buffer be reordered without changing code behavior?".
Consider following code tansformation:
int count1 = produceCount++;
buffer[count1 % BUFFER_SIZE] = produceToken();
Notice that code behaves exactly as original one: one read from volatile variable, one write to volatile, read happens before write, state of program is the same. However, other threads will see different picture regarding order of produceCount increment and buffer modifications.
Both compiler and CPU can do that transformation without memory fences, so you need to force those two operations to be in correct order.
If a fence is not used, won't this kind of reordering violate the correctness of code?
Nope. Can you construct any portable code that can tell the difference?
The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Why shouldn't it? What would the payoff be for the costs incurred? Things like write combining and speculative fetching are huge optimizations and disabling them is a non-starter.
If you're thinking that volatile alone should do it, that's simply not true. The volatile keyword has no defined thread synchronization semantics in C or C++. It might happen to work on some platforms and it might happen not to work on others. In Java, volatile does have defined thread synchronization semantics, but they don't include providing ordering for accesses to non-volatiles.
However, memory barriers do have well-defined thread synchronization semantics. We need to make sure that no thread can see that data is available before it sees that data. And we need to make sure that a thread that marks data as able to be overwritten is not seen before the thread is finished with that data.

What's the functionality of the function pm_runtime_put_sync()?

The function pm_runtime_put_sync() is called in spi-omap2-mcspi.c
Can somebody please explain what actually this function call does.
Thank you!
It calls __pm_runtime_idle(dev, RPM_GET_PUT) internally which is documented as
int __pm_runtime_idle(struct device *dev, int rpmflags)
Entry point for runtime idle operations.
* #dev: Device to send idle notification for.
* #rpmflags: Flag bits.
*
* If the RPM_GET_PUT flag is set, decrement the device's usage count and
* return immediately if it is larger than zero. Then carry out an idle
* notification, either synchronous or asynchronous.
*
* This routine may be called in atomic context if the RPM_ASYNC flag is set,
* or if pm_runtime_irq_safe() has been called.
here are the source and Documentation

Understanding Linux Kernel Circular Buffer

There is an article at: http://lwn.net/Articles/378262/ that describes the Linux kernels circular buffer implementation. I have some questions:
Here is the "producer":
spin_lock(&producer_lock);
unsigned long head = buffer->head;
unsigned long tail = ACCESS_ONCE(buffer->tail);
if (CIRC_SPACE(head, tail, buffer->size) >= 1) {
/* insert one item into the buffer */
struct item *item = buffer[head];
produce_item(item);
smp_wmb(); /* commit the item before incrementing the head */
buffer->head = (head + 1) & (buffer->size - 1);
/* wake_up() will make sure that the head is committed before
* waking anyone up */
wake_up(consumer);
}
spin_unlock(&producer_lock);
Questions:
Since this code explicitly deals with memory ordering and atomicity what is the point of the spin_lock()?
So far, my understanding is that ACCESS_ONCE stops compiler reordering, true?
Does produce_item(item) simply issue all of the writes associated with the item?
I believe smp_wmb() guarantees that all of the writes in produce_item(item) complete BEFORE the "publishing" write that follows it. true?
The commentary on the page where I got this code seems to imply that a smp_wmb() would
normally be needed after updating the head index, but wake_up(consumer) does this, so its not necessary. Is that true? If so why?
Here is the "consumer":
spin_lock(&consumer_lock);
unsigned long head = ACCESS_ONCE(buffer->head);
unsigned long tail = buffer->tail;
if (CIRC_CNT(head, tail, buffer->size) >= 1) {
/* read index before reading contents at that index */
smp_read_barrier_depends();
/* extract one item from the buffer */
struct item *item = buffer[tail];
consume_item(item);
smp_mb(); /* finish reading descriptor before incrementing tail */
buffer->tail = (tail + 1) & (buffer->size - 1);
}
spin_unlock(&consumer_lock);
Questions specific to "consumer":
What does smp_read_barrier_depends() do? From some comments in a forum it seems like you could have issued an smp_rmb() here, but on some architectures this is unnecessary (x86) and too expensive, so smp_read_barrier_depends() was created to do this optionally... That said, I don't really understand why smp_rmb() is ever necessary!
Is the smp_mb() there to guarantee that all of the reads before it complete before the write after it?
For the producer:
The spin_lock() here is to prevent two producers from trying to modify the queue at the same time.
ACCESS_ONCE does prevent reordering, it also prevents the compiler from reloading the value later. (There's an article about ACCESS_ONCE on LWN that expands on this further)
Correct.
Also correct.
The (implied) write barrier here is needed before waking the consumer as otherwise the consumer might not see the updated head value.
Consumer:
smp_read_barrier_depends() is a data dependency barrier, which is a weaker form of a read barrier (see 2). The effect in this case is to ensure that buffer->tail is read before using it as an array index in buffer[tail].
smp_mb() here is a full memory barrier, ensuring all reads and writes are committed by this point.
Additional references:
Linux kernel documentation on memory barriers
(Note: I'm not entirely sure about my answers for 5 in the producer and 1 for the consumer, but I believe they're a fair approximation of the facts. I highly recommend reading the documentation page about memory barriers, as it's more comprehensive than anything I could write here.)

c++ multi threading - lock one pointer assignment?

I have a method as below
SomeStruct* abc;
void NullABC()
{
abc = NULL;
}
This is just example and not very interesting.
Many thread could call this method at the same time.
Do I need to lock "abc = NULL" line?
I think it is just pointer so it could be done in one shot and there isn't really need for it but just wanted to make sure.
Thanks
It depends on the platform on which you are running. On many platforms, as long as abc is correctly aligned, the write will be atomic.
However, if your platform does not have such a guarantee, you need to synchronize access to the variable, using a lock, an atomic variable, or an interlocked operation.
No you do not need a lock, at least not on x86. A memory barrier is required in may real world situations though, and locking is one way to get this (the other would be an explicit barrier). You may also consider using an interlocked operation, like VisualC's InterlockedExchangePointer if you need access to the original pointer. There are equivalent intrinsics supported by most compilers.
If no other threads are ever using abc for any other purpose, then the code as shown is fine... but of course it's a bit silly to have a pointer that never gets used except to set it to NULL.
If there is some other code somewhere that does something like this, OTOH:
if (abc != NULL)
{
abc->DoSomething();
}
Then in this case both the code that uses the abc pointer (above) and the code that changes it (that you posted) needed to lock a mutex before accessing (abc). Otherwise the code above risks crashing if the value of abc gets set to NULL after the if statement but before the DoSomething() call.
A borderline case would be if the other code does this:
SomeStruct * my_abc = abc;
if (my_abc != NULL)
{
my_abc->DoSomething();
}
That will probably work, because at the time the abc pointer's value is copied over to my_abc, the value of abc is either NULL or it isn't... and my_abc is a local variable, so other thread's won't be able to change it before DoSomething() is called. The above could theoretically break on some platforms where copying of pointers isn't atomic though (in which case my_abc might end up being an invalid pointer, with have of abc's bits and half NULL bits)... but common/PC hardware will copy pointers atomically, so it shouldn't be an issue there. It might be worthwhile to use a Mutex anyway just to for paranoia's sake though.

ucontext across threads

Are contexts (the objects manipulated by functions in ucontext.h) allowed to be shared across threads? That is, can I swapcontext with the second argument being a context created in makecontext on another thread? A test program seems to show this working on Linux. I can't find documentation one way or the other on this, whereas Windows fibers appear to explicitly support such a use case. Is this safe and OK to do in general? Is it standard POSIX behavior that this should work?
Actually, there was an NGPT - threading library for linux, which uses not a current 1:1 threading model (each user thread is the kernel thread or LWP), but a M:N threading model (several user threads corresponds to another, smaller number of kernel threads).
According to ftp://ftp.uni-duisburg.de/Linux/NGPT/ngpt-0.9.4.tar.gz/ngpt-0.9.4/pth_sched.c:170 pth_scheduler it was possible of moving user thread contexts between native (kernel) threads:
/*
* See if the thread is unbound...
* Break out and schedule if so...
*/
if (current->boundnative == 0)
break;
/*
* See if the thread is bound to a different native thread...
* Break out and schedule if not...
*/
if (current->boundnative == this_sched->lastrannative)
break;
To save and restore user threads, the ucontext can be used ftp://ftp.uni-duisburg.de/Linux/NGPT/ngpt-0.9.4.tar.gz/ngpt-0.9.4/pth_mctx.c:64 and seems this was a preferred method (mcsc):
/*
* save the current machine context
*/
#if PTH_MCTX_MTH(mcsc)
#define pth_mctx_save(mctx) \
( (mctx)->error = errno, \
getcontext(&(mctx)->uc) )
#elif
....
/*
* restore the current machine context
* (at the location of the old context)
*/
#if PTH_MCTX_MTH(mcsc)
#define pth_mctx_restore(mctx) \
( errno = (mctx)->error, \
(void)setcontext(&(mctx)->uc) )
#elif PTH_MCTX_MTH(sjlj)
...
#if PTH_MCTX_MTH(mcsc)
/*
* VARIANT 1: THE STANDARDIZED SVR4/SUSv2 APPROACH
*
* This is the preferred variant, because it uses the standardized
* SVR4/SUSv2 makecontext(2) and friends which is a facility intended
* for user-space context switching. The thread creation therefore is
* straight-foreward.
*/
So, even if NGPT is dead and unused, it selected *context() for switching user threads even between kernel threads. I assume, that using *context() family is safe enough on Linux.
There can be some problems when mixing ucontexts and other native threads library. I will consider a NPTL, which is standard linux native threading library since glibc 2.4. The main problem is THREAD_SELF - pointer to struct pthread of the current thread. TLS (Thread-local storage) also works via THREAD_SELF. The THREAD_SELF is usually stored on register (r2 on powerpc, %gs on x86, etc). get/setcontext might save and restore this register breaking internals of native pthread library (e.g. thread-local storage, thread identification, etc).
The glibc setcontext will not save/restore %gs register to be compatible with pthreads:
/* Restore the FS segment register. We don't touch the GS register
since it is used for threads. */
movl oFS(%eax), %ecx
movw %cx, %fs
You should check, does setcontext saves THREAD_SELF register on the architecture you are interested in. Also, your code can be not portable between OSes and libcs.
From the man page
In a System V-like environment, one
has the type ucontext_t defined in
and the four functions
getcontext(2), setcontext(2),
makecontext() and swapcontext() that
allow user-level context switching
between multiple threads of control
within a process.
Sounds like that's what it's for.
EDIT: although this discussion seems to indicate that you shouldn't be mixing them.

Resources