Best equivalent for EnterCriticalSection on Mac OS X? - multithreading

What's the best equivalent? I didn't find any reasonable solution for such a simple function. Choices I'm aware of:
1) MPEnterCriticalRegion - this is unfortunately extremely ineffective, probably because despite it's name it enters kernel mode, so for repeating locks it just takes way too much time...
2) OSSpinLockLock - unusable, because apparently not recursive. If it would be recursive, it would be the correct equivalent.
3) pthread_mutex_lock - didn't try, but I don't expect much, because it will probably be just emulated using the critical region or another system resource.

Assuming you have a correctly working non-recursive lock, it's rather easy to get an efficient recursive lock (no idea about Mac APIs, so this is pseudo code):
class RecursiveLock {
public:
void acquire() {
auto tid = get_thread_id();
if (owner == tid) {
lockCnt++;
} else {
AcquireLock(lock);
owner = tid;
lockCnt = 1;
}
}
void release() {
assert(owner == get_thread_id());
lockCnt--;
if (lockCnt == 0) {
owner = 0; // some illegal value for thread id
ReleaseLock(lock);
}
}
private:
int lockCnt;
std::atomic<void*> owner;
void *lock; // use whatever lock you like here
};
It's simple to reason:
if tid == owner it's guaranteed that we have already acquired the lock.
if tid != owner either someone else holds the lock or it's free, in both cases we try to acquire the lock and block until we get it. After we acquire the lock we set the owner to tid. So there's some time where we have acquired the lock but owner is still illegal. But no problem since the illegal tid won't compare equal to any real thread's tid either, so they'll go into the else branch as well and have to wait to get it.
Notice the std::atomic part - we do need ordering guarantees for the owner field to make this legal. If you don't have c++11 use some compiler intrinsic for that.

Related

Where can PTHRED_MUTEX_ADAPTIVE_NP be specified and how does it work?

I found that there's a macro called PTHRED_MUTEX_ADAPTIVE_NP which is somehow given as a value to a mutex so that the mutex does an adaptive spinning, meaning that it spins in the magnitude of an immediate wakeup through the kernel would last. But how do I utilize this configuration-macro to a thread ?
And as I've developed an improved shared readers-writer lock (it needs only one atomic operation at best in contrast to the three operations given in the Wikipedia-solution) with relative writer-priority (further readers are stalled when there's a writer and the readers before are allowed to proceed) which could also make use of adaptive spinning: how is the number of spinning-cycles calculated ?
I found that there's a macro called PTHRED_MUTEX_ADAPTIVE_NP
Some pthreads implementations provide a macro PTHREAD_MUTEX_ADAPTIVE_NP (note spelling) that is one of the possible values of the kind_np mutex attribute, but neither that attribute nor the macro are standard. It looks like at least BSD and AIX have them, or at least did at one time, but this is not something you should be using in new code.
But how do I utilize this configuration-macro to a thread ?
You don't. Even if you are using a pthreads implementation that supports it, this is the value of a mutex attribute, not a thread attribute. You obtain a mutex with that attribute value by explicitly requesting it when you initialize the mutex. It would look something like this:
pthread_mutexattr_t attr;
pthread_mutex_t mutex;
int rval;
// Return-value checks omitted for brevity and clarity
rval = pthread_mutexattr_init(&attr);
rval = pthread_mutexattr_setkind_np(&attr, PTHREAD_MUTEX_ADAPTIVE_NP);
rval = pthread_mutex_init(&mutex, &attr);
There are other mutex attributes that you can set in analogous ways, which is one of the reasons I wrote this answer. Although you should not be using the kind_np attribute, you can follow this general model for other mutex attributes. There are also thread attributes, which work similarly.
I found the code in the glibc:
That's the "adaptive" mutex locking code of pthread_mutex_lock
in the glibc 2.31:
else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
== PTHREAD_MUTEX_ADAPTIVE_NP, 1))
{
if (! __is_smp)
goto simple;
if (LLL_MUTEX_TRYLOCK (mutex) != 0)
{
int cnt = 0;
int max_cnt = MIN (max_adaptive_count (),
mutex->__data.__spins * 2 + 10);
do
{
if (cnt++ >= max_cnt)
{
LLL_MUTEX_LOCK (mutex);
break;
}
atomic_spin_nop ();
}
while (LLL_MUTEX_TRYLOCK (mutex) != 0);
mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
}
assert (mutex->__data.__owner == 0);
}
So the spin count is doubled up to a maximum plus 10 first (system configurable or 1000 if thre's no configuration) and after the locking the difference between the actual spins and the predefined spins divided by 8 is added to the next spin-count.

Why would we want to make a function recursive when it has a mutex lock?

https://stackoverflow.com/a/5524120/462608
If you want to call functions recursively, which lock the same mutex, then they either
have to use one recursive mutex, or
have to unlock and lock the same non-recursive mutex again and again (beware of concurrent threads!), or
have to somehow annotate which mutexes they already locked (simulating recursive ownership/mutexes).
Can in any case this be a "sensible" design decision to make function recursive which already has a mutex lock?
Well, one possibility is that the resource you're using lends itself naturally to recursive algorithms.
Think of searching a binary tree, while preventing everyone else from using (especially modifying) the tree out with a mutex.
If you use a recursive mutex, you can simply have one function search() that you pass the root node in to. It then recursively calls itself as per a normal binary tree search but the first thing it does in that function is to lock the mutex (while this looks like Python, that's really just because Python is an ideal basis for pseudo-code):
def search (haystack, mutex, needle):
lock mutex recursively
if haystack == NULL:
unlock mutex and return NULL
if haystack.payload == needle:
unlock mutex and return haystack
if haystack.payload > needle:
found = search (haystack.left, mutex, needle)
else:
found = search (haystack.right, mutex, needle)
unlock mutex and return found
The alternative is to separate the mutex lock and search into two separate functions like search() (public) and search_while_locked() (most likely private):
def private search_while_locked (haystack, needle):
if haystack == NULL:
return NULL
if haystack.payload == needle:
return haystack
if haystack.payload > needle:
return search_while_locked (haystack.left, needle)
return search_while_locked (haystack.right, needle)
def search (haystack, mutex, needle):
lock mutex non-recursively
found = search_while_locked (haystack.right, needle)
unlock mutex and return found
While that sort of defeats the elegance of the recursive solution, I actually prefer it since it reduces the amount of work that needs to be done (however small that work is, it's still work).
And languages that lend themselves easily to public/private functions can encapsulate the details well. The user of your class has no knowledge (or need of knowledge) as to how you do things within your class, they just call the public API.
Your own functions, however, have access to all the non-public stuff as well as full knowledge as to what locks need to be in place for certain operations.
Another possibility is very much related to that but without being recursive.
Think of any useful operation you may want users to perform on your data which requires that no-one else be using it during that time. So far, you have just the classic case for a non-recursive mutex. For example, clearing all of the entries out of a queue:
def clearQueue():
lock mutex
while myQueue.first <> null:
myQueue.pop()
unlock mutex
Now let's say you find that rather useful and want to call it from your destructor, which already locks the mutex:
def destructor():
lock mutex
clearQueue()
doSomethingElseNeedingLock()
unlock mutex
Obviously, with a non-recursive mutex, that's going to lock up on the first line of clearQueue after your destructor calls it, which may be one reason why you'd want a recursive mutex.
You could use the afore-mentioned method of providing a locking public function and a non-locking private one:
def clearQueueLocked():
while myQueue.first <> null:
myQueue.pop()
def clearQueue():
lock mutex
clearQueueLocked():
unlock mutex
def destructor():
lock mutex
clearQueueLocked():
doSomethingElseNeedingLock()
unlock mutex and return
However, if there are a substantial number of these public/private function pairs, it may get a little messy.
In addition to paxdiablo's exmaple using an actual recursive funciotn, don't forget that using a mutex recursively doesn't necessarily mean that the functions involved are recursive. I've found use for recursive mutexes for dealing with a situation where you have complex operations which need to be atomic with respect to some data structure, with those complex operations rely on more fundamental operations that still need to use the mutex since the fundamental operations can be used on their own as well. An example might be something like the following (note that the code is illustrative only - it doesn't use proper error handling or transactional techniques that might really be necessary when dealing with accounts and logs):
struct account
{
mutex mux;
int balance;
// other important stuff...
FILE* transaction_log;
};
void write_timestamp( FILE*);
// "fundamental" operation to write to transaction log
void log_info( struct account* account, char* logmsg)
{
mutex_acquire( &account->mux);
write_timestamp( account->transaction_log);
fputs( logmsg, account->transaction_log);
mutex_release( &account->mux);
}
// "composed" operation that uses the fundamental operation.
// This relies on the mutex being recursive
void update_balance( struct account* account, int amount)
{
mutex_acquire( &account->mux);
int new_balance = account->balance + amount;
char msg[MAX_MSG_LEN];
snprintf( msg, sizeof(msg), "update_balance: %d, %d, %d", account->balance, amount, new_balance);
// the following call will acquire the mutex recursively
log_info( account, msg);
account->balance = new_balance;
mutex_release( &account->mux);
}
To do something more or less equivalent without recursive mutexes means that the code would need to take care not to reacquire the mutex if it already held it. One option is to add some sort of flag (or thread ID) to the data structure to indicate if the mutex is already held. In this case, you're essentially implementing recursive mutexes - a trickier bit of work than it might seem at first to get right. An alternative is to pass a flag indicating you already acquired the mutex to functions as a parameter (easier to implement and get right) or simply have even more fundamental operations that assume the mutex is already acquired and call those from the higher level functions that take on the responsibility of acquiring the mutex:
// "fundamental" operation to write to transaction log
// this version assumes that the lock is already held
static
void log_info_nolock( struct account* account, char* log msg)
{
write_timestamp( account->transaction_log);
fputs( logmsg, account->transaction_log);
}
// "public" version of the log_info() function that
// acquires the mutex
void log_info( struct account* account, char* logmsg)
{
mutex_acquire( &account->mux);
log_info_nolock( account, logmsg);
mutex_release( &account->mux);
}
// "composed operation that uses the fundamental operation
// since this function acquires the mutex, it much call the
// "nolock" version of the log_info() function
void update_balance( int amount)
{
mutex_acquire( &account->mux);
int new_balance = account->balance + amount;
char msg[MAX_MSG_LEN];
snprintf( msg, sizeof(msg), "update_balance: %d, %d, %d", account->balance, amount, new_balance);
// the following call assumes the lock is already acquired
log_info_nolock( account, msg);
account->balance = new_balance;
mutex_release( &account->mux);
}

Native mutex implementation

So in my ilumination days, i started to think about how the hell do windows/linux implement the mutex, i've implemented this synchronizer in 100... different ways, in many diferent arquitectures but never think how it is really implemented in big ass OS, for example in the ARM world i made some of my synchronizers disabling the interrupts but i always though that it wasn't a really good way to do it.
I tried to "swim" throgh the linux kernel but just like a though i can't see nothing that satisfies my curiosity. I'm not an expert in threading, but i have solid all the basic and intermediate concepts of it.
So does anyone know how a mutex is implemented?
A quick look at code apparently from one Linux distribution seems to indicate that it is implemented using an interlocked compare and exchange. So, in some sense, the OS isn't really implementing it since the interlocked operation is probably handled at the hardware level.
Edit As Hans points out, the interlocked exchange does the compare and exchange in an atomic manner. Here is documentation for the Windows version. For fun, I just now wrote a small test to show a really simple example of creating a mutex like that. This is a simple acquire and release test.
#include <windows.h>
#include <assert.h>
#include <stdio.h>
struct homebrew {
LONG *mutex;
int *shared;
int mine;
};
#define NUM_THREADS 10
#define NUM_ACQUIRES 100000
DWORD WINAPI SomeThread( LPVOID lpParam )
{
struct homebrew *test = (struct homebrew*)lpParam;
while ( test->mine < NUM_ACQUIRES ) {
// Test and set the mutex. If it currently has value 0, then it
// is free. Setting 1 means it is owned. This interlocked function does
// the test and set as an atomic operation
if ( 0 == InterlockedCompareExchange( test->mutex, 1, 0 )) {
// this tread now owns the mutex. Increment the shared variable
// without an atomic increment (relying on mutex ownership to protect it)
(*test->shared)++;
test->mine++;
// Release the mutex (4 byte aligned assignment is atomic)
*test->mutex = 0;
}
}
return 0;
}
int main( int argc, char* argv[] )
{
LONG mymutex = 0; // zero means
int shared = 0;
HANDLE threads[NUM_THREADS];
struct homebrew test[NUM_THREADS];
int i;
// Initialize each thread's structure. All share the same mutex and a shared
// counter
for ( i = 0; i < NUM_THREADS; i++ ) {
test[i].mine = 0; test[i].shared = &shared; test[i].mutex = &mymutex;
}
// create the threads and then wait for all to finish
for ( i = 0; i < NUM_THREADS; i++ )
threads[i] = CreateThread(NULL, 0, SomeThread, &test[i], 0, NULL);
for ( i = 0; i < NUM_THREADS; i++ )
WaitForSingleObject( threads[i], INFINITE );
// Verify all increments occurred atomically
printf( "shared = %d (%s)\n", shared,
shared == NUM_THREADS * NUM_ACQUIRES ? "correct" : "wrong" );
for ( i = 0; i < NUM_THREADS; i++ ) {
if ( test[i].mine != NUM_ACQUIRES ) {
printf( "Thread %d cheated. Only %d acquires.\n", i, test[i].mine );
}
}
}
If I comment out the call to the InterlockedCompareExchange call and just let all threads run the increments in a free-for-all fashion, then the results do result in failures. Running it 10 times, for example, without the interlocked compare call:
shared = 748694 (wrong)
shared = 811522 (wrong)
shared = 796155 (wrong)
shared = 825947 (wrong)
shared = 1000000 (correct)
shared = 795036 (wrong)
shared = 801810 (wrong)
shared = 790812 (wrong)
shared = 724753 (wrong)
shared = 849444 (wrong)
The curious thing is that one time the results showed now incorrect contention. That might be because there is no "everyone start now" synchronization; maybe all threads started and finished in order in that case. But when I have the InterlockedExchangeCall in place, it runs without failure (or at least it ran 100 times without failure ... that doesn't prove I didn't write a subtle bug into the example).
Here is the discussion from the people who implemented it ... very interesting as it shows the tradeoffs ..
Several posts from Linus T ... of course
In earlier days pre-POSIX etc I used to implement synchronization by using a native mode word (e.g. 16 or 32 bit word) and the Test And Set instruction lurking on every serious processor. This instruction guarantees to test the value of a word and set it in one atomic instruction. This provides the basis for a spinlock and from that a hierarchy of synchronization functions could be built. The simplest is of course just a spinlock which performs a busy wait, not an option for more than transitory sync'ing, then a spinlock which drops the process time slice at each iteration for a lower system impact. Notional concepts like Semaphores, Mutexes, Monitors etc can be built by getting into the kernel scheduling code.
As I recall the prime usage was to implement message queues to permit multiple clients to access a database server. Another was a very early real time car race result and timing system on a quite primitive 16 bit machine and OS.
These days I use Pthreads and Semaphores and Windows Events/Mutexes (mutices?) etc and don't give a thought as to how they work, although I must admit that having been down in the engine room does give one and intuitive feel for better and more efficient multiprocessing.
In windows world.
The mutex before the windows vista mas implemented with a Compare Exchange to change the state of the mutex from Empty to BeingUsed, the other threads that entered the wait on the mutex the CAS will obvious fail and it must be added to the mutex queue for furder notification. Those operations (add/remove/check) of the queue would be protected by an common lock in windows kernel.
After Windows XP, the mutex started to use a spin lock for performance reasons being a self-suficiant object.
In unix world i didn't get much furder but probably is very similar to the windows 7.
Finally for kernels that work on a single processor the best way is to disable the interrupts when entering the critical section and re-enabling then when exiting.

Strange behavior of printk in linux kernel module

I am writing a code for linux kernel module and experiencing a strange behavior in it.
Here is my code:
int data = 0;
void threadfn1()
{
int j;
for( j = 0; j < 10; j++ )
printk(KERN_INFO "I AM THREAD 1 %d\n",j);
data++;
}
void threadfn2()
{
int j;
for( j = 0; j < 10; j++ )
printk(KERN_INFO "I AM THREAD 2 %d\n",j);
data++;
}
static int __init abc_init(void)
{
struct task_struct *t1 = kthread_run(threadfn1, NULL, "thread1");
struct task_struct *t2 = kthread_run(threadfn2, NULL, "thread2");
while( 1 )
{
printk("debug\n"); // runs ok
if( data >= 2 )
{
kthread_stop(t1);
kthread_stop(t2);
break;
}
}
printk(KERN_INFO "HELLO WORLD\n");
}
Basically I was trying to wait for threads to finish and then print something after that.
The above code does achieve that target but WITH "printk("debug\n");" not commented. As soon as I comment out printk("debug\n"); to run the code without debugging and load the module through insmod command, the module hangs on and it seems like it gets lost in recursion. I dont why printk effects my code in such a big way?
Any help would be appreciated.
regards.
You're not synchronizing the access to the data-variable. What happens is, that the compiler will generate a infinite loop. Here is why:
while( 1 )
{
if( data >= 2 )
{
kthread_stop(t1);
kthread_stop(t2);
break;
}
}
The compiler can detect that the value of data never changes within the while loop. Therefore it can completely move the check out of the loop and you'll end up with a simple
while (1) {}
If you insert printk the compiler has to assume that the global variable data may change (after all - the compiler has no idea what printk does in detail) therefore your code will start to work again (in a undefined behavior kind of way..)
How to fix this:
Use proper thread synchronization primitives. If you wrap the access to data into a code section protected by a mutex the code will work. You could also replace the variable data and use a counted semaphore instead.
Edit:
This link explains how locking in the linux-kernel works:
http://www.linuxgrill.com/anonymous/fire/netfilter/kernel-hacking-HOWTO-5.html
With the call to printk() removed the compiler is optimising the loop into while (1);. When you add the call to printk() the compiler is not sure that data isn't changed and so checks the value each time through the loop.
You can insert a barrier into the loop, which forces the compiler to reevaluate data on each iteration. eg:
while (1) {
if (data >= 2) {
kthread_stop(t1);
kthread_stop(t2);
break;
}
barrier();
}
Maybe data should be declared volatile? It could be that the compiler is not going to memory to get data in the loop.
Nils Pipenbrinck's answer is spot on. I'll just add some pointers.
Rusty's Unreliable Guide to Kernel Locking (every kernel hacker should read this one).
Goodbye semaphores?, The mutex API (lwn.net articles on the new mutex API introduced in early 2006, before that the Linux kernel used semaphores as mutexes).
Also, since your shared data is a simple counter, you can just use the atomic API (basically, declare your counter as atomic_t and access it using atomic_* functions).
Volatile might not always be "bad idea". One needs to separate out
the case of when volatile is needed and when mutual exclusion
mechanism is needed. It is non optimal when one uses or misuses
one mechanism for the other. In the above case. I would suggest
for optimal solution, that both mechanisms are needed: mutex to
provide mutual exclusion, volatile to indicate to compiler that
"info" must be read fresh from hardware. Otherwise, in some
situation (optimization -O2, -O3), compilers might inadvertently
leave out the needed codes.

Linux synchronization with FIFO waiting queue

Are there locks in Linux where the waiting queue is FIFO? This seems like such an obvious thing, and yet I just discovered that pthread mutexes aren't FIFO, and semaphores apparently aren't FIFO either (I'm working on kernel 2.4 (homework))...
Does Linux have a lock with FIFO waiting queue, or is there an easy way to make one with existing mechanisms?
Here is a way to create a simple queueing "ticket lock", built on pthreads primitives. It should give you some ideas:
#include <pthread.h>
typedef struct ticket_lock {
pthread_cond_t cond;
pthread_mutex_t mutex;
unsigned long queue_head, queue_tail;
} ticket_lock_t;
#define TICKET_LOCK_INITIALIZER { PTHREAD_COND_INITIALIZER, PTHREAD_MUTEX_INITIALIZER }
void ticket_lock(ticket_lock_t *ticket)
{
unsigned long queue_me;
pthread_mutex_lock(&ticket->mutex);
queue_me = ticket->queue_tail++;
while (queue_me != ticket->queue_head)
{
pthread_cond_wait(&ticket->cond, &ticket->mutex);
}
pthread_mutex_unlock(&ticket->mutex);
}
void ticket_unlock(ticket_lock_t *ticket)
{
pthread_mutex_lock(&ticket->mutex);
ticket->queue_head++;
pthread_cond_broadcast(&ticket->cond);
pthread_mutex_unlock(&ticket->mutex);
}
If you are asking what I think you are asking the short answer is no. Threads/processes are controlled by the OS scheduler. One random thread is going to get the lock, the others aren't. Well, potentially more than one if you are using a counting semaphore but that's probably not what you are asking.
You might want to look at pthread_setschedparam but it's not going to get you where I suspect you want to go.
You could probably write something but I suspect it will end up being inefficient and defeat using threads in the first place since you will just end up randomly yielding each thread until the one you want gets control.
Chances are good you are just thinking about the problem in the wrong way. You might want to describe your goal and get better suggestions.
I had a similar requirement recently, except dealing with multiple processes. Here's what I found:
If you need 100% correct FIFO ordering, go with caf's pthread ticket lock.
If you're happy with 99% and favor simplicity, a semaphore or a mutex can do really well actually.
Ticket lock can be made to work across processes:
You need to use shared memory, process-shared mutex and condition variable, handle processes dying with the mutex locked (-> robust mutex) ... Which is a bit overkill here, all I need is the different instances don't get scheduled at the same time and the order to be mostly fair.
Using a semaphore:
static sem_t *sem = NULL;
void fifo_init()
{
sem = sem_open("/server_fifo", O_CREAT, 0600, 1);
if (sem == SEM_FAILED) fail("sem_open");
}
void fifo_lock()
{
int r;
struct timespec ts;
if (clock_gettime(CLOCK_REALTIME, &ts) == -1) fail("clock_gettime");
ts.tv_sec += 5; /* 5s timeout */
while ((r = sem_timedwait(sem, &ts)) == -1 && errno == EINTR)
continue; /* Restart if interrupted */
if (r == 0) return;
if (errno == ETIMEDOUT) fprintf(stderr, "timeout ...\n");
else fail("sem_timedwait");
}
void fifo_unlock()
{
/* If we somehow end up with more than one token, don't increment the semaphore... */
int val;
if (sem_getvalue(sem, &val) == 0 && val <= 0)
if (sem_post(sem)) fail("sem_post");
usleep(1); /* Yield to other processes */
}
Ordering is almost 100% FIFO.
Note: This is with a 4.4 Linux kernel, 2.4 might be different.

Resources