pthread condition variables on Linux, odd behaviour - linux

I'm synchronizing reader and writer processes on Linux.
I have 0 or more process (the readers) that need to sleep until they are woken up, read a resource, go back to sleep and so on. Please note I don't know how many reader processes are up at any moment.
I have one process (the writer) that writes on a resource, wakes up the readers and does its business until another resource is ready (in detail, I developed a no starve reader-writers solution, but that's not important).
To implement the sleep / wake up mechanism I use a Posix condition value, pthread_cond_t. The clients call a pthread_cond_wait() on the variable to sleep, while the server does a pthread_cond_broadcast() to wake them all up. As the manual says, I surround these two calls with a lock/unlock of the associated pthread mutex.
The condition variable and the mutex are initialized in the server and shared between processes through a shared memory area (because I'm not working with threads, but with separate processes) an I'm sure my kernel / syscall support it (because I checked _POSIX_THREAD_PROCESS_SHARED).
What happens is that the first client process sleeps and wakes up perfectly. When I start the second process, it blocks on its pthread_cond_wait() and never wakes up, even if I'm sure (by the logs) that pthread_cond_broadcast() is called.
If I kill the first process, and launch another one, it works perfectly. In other words, the condition variable pthread_cond_broadcast() seems to wake up only one process a time. If more than one process wait on the very same shared condition variable, only the first one manages to wake up correctly, while the others just seem to ignore the broadcast.
Why this behaviour? If I send a pthread_cond_broadcast(), every waiting process should wake up, not just one (and, however, not always the same one).

Have you set the PTHREAD_PROCESS_SHARED attribute on both your condvar and mutex?
For Linux consult the following man pages:
pthread_mutexattr_init (with sample)
pthread_mutexattr_setpshared
pthread_condattr_init
pthread_condattr_setpshared
Methods, types, constants etc. are normally defined in /usr/include/pthread.h, /usr/include/nptl/pthread.h.

Do you test for some condition before calling pthread_cond_wait() ? I am asking because, it's a very common mistake : Your process must not call wait() unless you know some other process will call signal() (or broadcast()) later.
concidering this code (from pthread_cond_wait man page) :
pthread_mutex_lock(&mut);
while (x <= y) {
pthread_cond_wait(&cond, &mut);
}
/* operate on x and y */
pthread_mutex_unlock(&mut);
If your omit the while test, and just signal from another process whenever your (x <= y) condition is true, it won't work since the signal only wakes up the processes that are already waiting. If signal() was called before the other process calls wait() the signal will be lost and the waiting process will be waiting forever.
EDIT : About the while loop.
When you are signaling one process from another process it is set on the ''ready list'' but not necessarily scheduled and your condition (x <= y) may be change again since no one holds the lock. That's why you need to check for your condition each time you are about to wait. It should always be wakeup -> check if the condition is still true -> do work.
hope it's clear.

The documentation says that it should work... are you sure it's the same conditional value that the rest of the threads are looking at?
This is the example code from opengroup.org:
pthread_cond_wait(mutex, cond):
value = cond->value; /* 1 */
pthread_mutex_unlock(mutex); /* 2 */
pthread_mutex_lock(cond->mutex); /* 10 */
if (value == cond->value) { /* 11 */
me->next_cond = cond->waiter;
cond->waiter = me;
pthread_mutex_unlock(cond->mutex);
unable_to_run(me);
} else
pthread_mutex_unlock(cond->mutex); /* 12 */
pthread_mutex_lock(mutex); /* 13 */
pthread_cond_signal(cond):
pthread_mutex_lock(cond->mutex); /* 3 */
cond->value++; /* 4 */
if (cond->waiter) { /* 5 */
sleeper = cond->waiter; /* 6 */
cond->waiter = sleeper->next_cond; /* 7 */
able_to_run(sleeper); /* 8 */
}
pthread_mutex_unlock(cond->mutex); /* 9 */

what the last poster said is correct. the KEY to the whole cond-variable situation working correctly is that the cond-var is NOT signalled prior to it being waited on. its strictly a signal that is to be used when others (single or multiple) are waiting. when no one is waiting, its effectively a NOP. which, btw, is NOT how i believe it SHOULD work, but how it DOES work.
larry

Related

How does SIGSTOP work in Linux kernel?

I am wondering how SIGSTOP works inside the Linux Kernel. How is it handled? And how the kernel stops running when it is handled?
I am familiar with the kernel code base. So, if you can reference kernel functions that will be fine, and in fact that is what I want. I am not looking for high level description from a user's perspective.
I have already bugged the get_signal_to_deliver() with printk() statements (it is compiling right now). But I would like someone to explain things in better details.
It's been a while since I touched the kernel, but I'll try to give as much detail as possible. I had to look up some of this stuff in various other places, so some details might be a little messy, but I think this gives a good idea of what happens under the hood.
When a signal is raised, the TIF_SIGPENDING flag is set in the process descriptor structure. Before returning to user mode, the kernel tests this flag with test_thread_flag(TIF_SIGPENDING), which will return true (because a signal is pending).
The exact details of where this happens seem to be architecture dependent, but you can see an example for um:
void interrupt_end(void)
{
struct pt_regs *regs = &current->thread.regs;
if (need_resched())
schedule();
if (test_thread_flag(TIF_SIGPENDING))
do_signal(regs);
if (test_and_clear_thread_flag(TIF_NOTIFY_RESUME))
tracehook_notify_resume(regs);
}
Anyway, it ends up calling arch_do_signal(), which is also architecture dependent and is defined in the corresponding signal.c file (see the example for x86):
void arch_do_signal(struct pt_regs *regs)
{
struct ksignal ksig;
if (get_signal(&ksig)) {
/* Whee! Actually deliver the signal. */
handle_signal(&ksig, regs);
return;
}
/* Did we come from a system call? */
if (syscall_get_nr(current, regs) >= 0) {
/* Restart the system call - no handlers present */
switch (syscall_get_error(current, regs)) {
case -ERESTARTNOHAND:
case -ERESTARTSYS:
case -ERESTARTNOINTR:
regs->ax = regs->orig_ax;
regs->ip -= 2;
break;
case -ERESTART_RESTARTBLOCK:
regs->ax = get_nr_restart_syscall(regs);
regs->ip -= 2;
break;
}
}
/*
* If there's no signal to deliver, we just put the saved sigmask
* back.
*/
restore_saved_sigmask();
}
As you can see, arch_do_signal() calls get_signal(), which is also in signal.c.
The bulk of the work happens inside get_signal(), it's a huge function, but eventually it seems to process the special case of SIGSTOP here:
if (sig_kernel_stop(signr)) {
/*
* The default action is to stop all threads in
* the thread group. The job control signals
* do nothing in an orphaned pgrp, but SIGSTOP
* always works. Note that siglock needs to be
* dropped during the call to is_orphaned_pgrp()
* because of lock ordering with tasklist_lock.
* This allows an intervening SIGCONT to be posted.
* We need to check for that and bail out if necessary.
*/
if (signr != SIGSTOP) {
spin_unlock_irq(&sighand->siglock);
/* signals can be posted during this window */
if (is_current_pgrp_orphaned())
goto relock;
spin_lock_irq(&sighand->siglock);
}
if (likely(do_signal_stop(ksig->info.si_signo))) {
/* It released the siglock. */
goto relock;
}
/*
* We didn't actually stop, due to a race
* with SIGCONT or something like that.
*/
continue;
}
See the full function here.
do_signal_stop() does the necessary processing to handle SIGSTOP, you can also find it in signal.c. It sets the task state to TASK_STOPPED with set_special_state(TASK_STOPPED), a macro that is defined in include/sched.h that updates the current process descriptor status. (see the relevant line in signal.c). Further down, it calls freezable_schedule() which in turn calls schedule(). schedule() calls __schedule() (also in the same file) in a loop until an eligible task is found. __schedule() attempts to find the next task to schedule (next in the code), and the current task is prev. The state of prev is checked, and because it was changed to TASK_STOPPED, deactivate_task() is called, which moves the task from the run queue to the sleep queue:
} else {
...
deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
...
}
deactivate_task() (also in the same file) removes the process from the runqueue by decrementing the on_rq field of the task_struct to 0 and calling dequeue_task(), which moves the process to the new (waiting) queue.
Then, schedule() checks the number of runnable processes and selects the next task to enter the CPU according to the scheduling policies in effect (I think this is a little bit out of scope by now).
At the end of the day, SIGSTOP moves a process from the runnable queue to a waiting queue until that process receives SIGCONT.
Nearly every time there is an interrupt, the kernel suspends some process from running and switches to running the interrupt handler (the only exception being when there is no process running). Likewise, the kernel will suspend processes that run too long without giving up the CPU (and technically that's the same thing: it just originates from the timer interrupt or possibly an IPI). Ordinarily in these cases, the kernel then puts the suspended process back on the run queue and when the scheduling algorithm decides the time is right, it is resumed.
In the case of SIGSTOP, the same basic thing happens: the affected processes are suspended due to the reception of the stop signal. They just don't get put back on the run queue until SIGCONT is sent. Nothing extraordinary here: SIGSTOP is just instructing the kernel to make a process non-runnable until further notice.
[One note: you seemed to imply that the kernel stops running with SIGSTOP. That is of course not the case. Only the SIGSTOPped processes stop running.]

Linux kernel - wait queues

I'm reading "Linux kernel development 3rd edition by Robert Love" to get a general idea about how the Linux kernel works..(2.6.2.3)
I'm confused about how wait queues work for example this code:
/* ‘q’ is the wait queue we wish to sleep on */
DEFINE_WAIT(wait);
add_wait_queue(q, &wait);
while (!condition) { /* condition is the event that we are waiting for */
prepare_to_wait(&q, &wait, TASK_INTERRUPTIBLE);
if (signal_pending(current))
/* handle signal */
schedule();
}
finish_wait(&q, &wait);
I want to know which process is running this code? is it a kernel thread? whose process time is this?
And also in the loop, while the condition is still not met we will continue sleeping and call schedule to run another process the question is when do we return to this loop?
The book says that when a process sleeps, it's removed from our run queue, else it would be waken and have to enter a busy loop...
Also says: "sleeping should always be handled in a loop that ensures that the condition for which the task is waiting has indeed occurred."
I just want to know in what context is this loop running?
Sorry if this is a stupid Question. I'm just having trouble seeing the big pic
Which process is running the code? The process that called it. I don't mean to make fun of the question but the gist is that kernel code can run in different contexts: Either because a system call led to this place, because it is in a interrupt handler, or because it is a callback function called from another context (such as workqueues or timer functions).
Since this example is sleeping, it must be in a context where sleeping is allowed, meaning it is executed in response to a system call or at least in a kernel thread. So the answer is the process time is taken from the process (or kernel thread) that called into this kernel code that needs to sleep. That is the only place where sleeping is allowed in the first place.
A certain special case are workqueues, these are explicitly for functions that need to sleep. Typical use would be to queue a function that needs to sleep from a context where sleeping is forbidden. In that case, the process context is that of one of the kernel worker threads designated to process workqueue items.
You will return to this loop when the wait_queue is woken up, which either sets one task waiting on the queue to runnable or all of them, depending on the wake_up function called.
The most important thing is, forget about this unless you are interested in the implementation details. Since many people got this wrong and it's basically the same thing everywhere it's needed, there have long been macros encapsulating the whole procedure. Look up wait_event(), that's how your example should really look like:
wait_event(q, condition);
As per your example... I added comments....
NOTE: while creating waiting queue by default it will be in sleep stat.
DEFINE_WAIT(wait); /* first wait ---> it the kernel global wait queue it is pointing */
add_wait_queue(q, &wait); /* first wait ---> it the kernel global wait queue it is pointing using add_wait_queue(q, &wait); ---> you are adding your own waiting queue (like appending linked list) */
while (!condition) {
/* condition is the event that we are waiting for */
/*condition --> Let's say you are getting data from user space in write method (using __get_user()) */
prepare_to_wait(&q, &wait, TASK_INTERRUPTIBLE);
/* This will wait when any wake_up_process() call will be generated having interrupt */
if (signal_pending(current))
/* This is continuously monitoring if any signal is pending on current CPU on which wait queue is running while not pending any signal generally used return -ERESTARTSYS; or "break" the loop if interrupts came exa., SIGINT or SIGKILL and finishes wait queue statement to check again /
/ handle signal */
schedule(); // Scheduling of wait queue
// Remove from global data structure
}
finish_wait(&q, &wait); //Finishing wait queue

When to call sem_unlink()?

I'm a little confused by the Linux API sem_unlink(), mainly when or why to call it. I've used semaphores in Windows for many years. In Windows once you close the last handle of a named semaphore the system removes the underlying kernel object. But it appears in Linux you, the developer, needs to remove the kernel object by calling sem_unlink(). If you don't the kernel object persists in the /dev/shm folder.
The problem I'm running into, if process A calls sem_unlink() while process B has the semaphore locked, it immediately destroys the semaphore and now process B is no longer "protected" by the semaphore when/if process C comes along. What's more, the man page is confusing at best:
"The semaphore name is removed immediately. The semaphore is destroyed once all other processes that have the semaphore open close it."
How can it destroy the object immediately if it has to wait for other processes to close the semaphore?
Clearly I don't understand the proper use of semaphore objects on Linux. Thanks for any help. Below is some sample code I'm using to test this.
int main(void)
{
sem_t *pSemaphore = sem_open("/MyName", O_CREAT, S_IRUSR | S_IWUSR, 1);
if(pSemaphore != SEM_FAILED)
{
if(sem_wait(pSemaphore) == 0)
{
// Perform "protected" operations here
sem_post(pSemaphore);
}
sem_close(pSemaphore);
sem_unlink("/MyName");
}
return 0;
}
Response to your questions:
In comparison to the semaphore behavior for windows you
describe, POSIX semaphores are Kernel persistent. Meaning that the
semaphore retains it's value even if no process has the semaphore
opened. (the semaphore's reference count would be 0)
If process A calls sem_unlink() while process B has the semaphore
locked. This means the semaphore's reference count is not 0 and will
not be destructed.
Basic operation of sem_close vs sem_unlink, I think will help overall understanding:
sem_close: close's a semaphore, this also done when a process exits. the semaphore still remains in the system.
sem_unlink: will be removed from the system only when the reference count reaches 0 (that is after all processes that have it open, call sem_close or are exited).
References:
Book - Unix Networking Programming-Interprocess Communication by W.Richard Stevens, vol 2, ch10
The sem_unlink() function removes the semaphore identified by name and marks
the semaphore to be destroyed once all processes cease using it (this may mean
immediately, if all processes that had the semaphore open have already closed it).

How does Wait/Signal (semaphore) implementation pseudo-code "work"?

Wait(semaphore sem) {
DISABLE_INTS
sem.val--
if (sem.val < 0){
add thread to sem.L
block(thread)
}
ENABLE_INTS
Signal(semaphore sem){
DISABLE_INTS
sem.val++
if (sem.val <= 0) {
th = remove next
thread from sem.L
wakeup(th)
}
ENABLE_INTS
If block(thread) stops a thread from executing, how, where, and when does it return?
Which thread enables interrupts following the Wait()?
the thread that called block() shouldn’t return until another thread has called wakeup(thread)!
but how does that other thread get to run?
where exactly does the thread switch occur?
block(thread) works that way:
Enables interrupts
Uses some kind of waiting mechanism (provided by the operating system or the busy waiting in the simplest case) to wait until the wakeup(thread) on this thread is called. This means that in this point thread yields its time to the scheduler.
Disables interrupts and returns.
Yes, UP and DOWN are mostly useful when called from different threads, but it is not impossible that you call these with one thread - if you start semaphore with a value > 0, then the same thread can entry the critical section and execute both DOWN (before) and UP (after). Value which initializes the semaphore tells how many threads can enter the critical section at once, which might be 1 (mutex) or any other positive number.
How are the threads created? That is not shown on the lecture slide, because that is only a principle how semaphore works using a pseudocode. But it is a completely different story how you use those semaphores in your application.

flock locking order?

im using a simple test script from
http://www.tuxradar.com/practicalphp/8/11/0
like this
<?php
$fp = fopen("foo.txt", "w");
if (flock($fp, LOCK_EX)) {
print "Got lock!\n";
sleep(10);
flock($fp, LOCK_UN);
}
i opened 5 shell's and executed the script one after the other
the scripts block until the lock is free'ed and then continues after released
im not really interessted in php stuff, but my question is:
anyone knows the order in which flock() is acquired?
e.g.
t0: process 1 lock's
t1: process 2 try_lock < blocking
t2: process 3 try_lock < blocking
t3: process 1 releases lock
t4: ?? which process get's the lock?
is there a simple deterministic order, like a queue or does the kernel 'just' pick one by "more advanced rules"?
If there are multiple processes waiting for an exclusive lock, it's not specified which one succeeds in acquiring it first. Don't rely on any particular ordering.
Having said that, the current kernel code wakes them in the order they blocked. This comment is in fs/locks.c:
/* Insert waiter into blocker's block list.
* We use a circular list so that processes can be easily woken up in
* the order they blocked. The documentation doesn't require this but
* it seems like the reasonable thing to do.
*/
If you want to have a set of processes run in order, don't use flock(). Use SysV semaphores (semget() / semop()).
Create a semaphore set that contains one semaphore for each process after the first, and initialise them all to -1. For every process after the first, do a semop() on that process's semaphore with a sem_op value of zero - this will block it. After the first process is complete, it should do a semop() on the second process's semaphore with a sem_op value of 1 - this will wake the second process. After the second process is complete, it should do a semop() on the third process's semaphore with a sem_op value of 1, and so on.

Resources