posix: interprocess lock abandoned, is there a better way? - multithreading

I'm coding on AIX, but looking for a general 'nix solution, posix compliant ideally. Can't use anything in C++11 or later.
I have shared memory with many threads from many processes involved. The data in shared memory has to stay self-consistent, so I need a lock, to get everyone to take turns.
Processes crashing with the lock is a thing, so I have to be able to detect an abandoned lock, fix (aka reset) the data, and move on. Twist: deciding the lock is abandoned by waiting for it for some fixed period is not a viable solution.
A global mutex (either living in shared memory, or named) appears not to be a solution. There's no detection mechanism for abandonment (except timing) and even then you can't delete and reform the mutex without risking undefined behaviour.
So I opted for lockf() and a busy flag - get the file lock, set the flag in shared memory, do stuff, unset the flag, drop the lock. On a crash with the lock owned, the lock is automatically dropped, and the next guy to get it can see the busy flag is still set, and knows he has to clean up a mess.
This doesn't work - because lockf() will keep threads from other processes out, but it has special semantics for other threads in your own process. It lets them through unchecked.
In the end I came up with a two step solution - a local (thread) mutex and a file lock. Get the local mutex first; now you're the only thread in this process doing the next step, which is lockf(). lockf() in turn guarantees you're the only process getting through, so now you can set the busy flag and do the work. To unlock, go in reverse order: clear the busy flag, drop the file lock, drop the mutex lock. In a crash, the local mutex vanishes when the process does, so it's harmless.
Works fine. I hate it. Using two locks nested like this strikes me as expensive, and takes a page worth of comments in the code to explain. (My next code review will be interesting). I feel like I missed a better solution. What is it?
Edit: #Matt I probably wasn't clear. The busy flag isn't part of the locking mechanism; it's there to indicate when some process successfully acquired the lock(s). If, after acquiring the locks, you see the busy flag is already set, it means some other process got the locks and then crashed, leaving the shared memory it was in the middle of writing to in an incomplete state. In that case the thread now in possess of the lock gets the job of re-initializing the shared memory to a usable state. I probably should have called it a "memoryBeingModified" flag.
No variation of "tryLock" is going to be permissible. Polling is absolutely out of the question in this application. Threads that need to modify shared memory may only block on the locks (which are never held long) and have to take their turn as soon as the lock is available to them. They have to experience the minimum possible delay.

You can just
//always returns true unless something horrible happened
bool lock()
{
if (pthread_mutex_lock(&local_mutex)==0)
{
if (lockf(global_fd, F_LOCK, 0))
return true;
pthread_mutex_unlock(&local_mutex);
}
return false;
}
void unlock()
{
lockf(global_fd, F_ULOCK, 0);
pthread_mutex_unlock(&local_mutex);
}
This seems pretty straightforward to me, and I wouldn't feel too bad about using 2 levels of lock -- the pthread_mutex is quite fast and consumes almost no resources.

The simple answer is, there's no good solution. On AIX, lockf turns out to be extremely slow, for no good reason. But mutexes in shared memory, while very fast on any platform, are fragile (anyone can crash while holding the lock and there's no recovery for that.) It would be nice is posix defined a "this mutex is held by a thread/process that died ", but it doesn't and even if there was such an error code, there's no way to repair things and continue. Using shared memory with multiple readers and writers continues to be the wild west.

Related

Make thread wait for condition but allow thread to remain usable while waiting or listening for a signal

Given a situation where thread A had to dispatch work to thread B, is there any synchronisation mechanism that allows thread A to not return, but remain usable for other tasks, until thread B is done, of which then thread A can return?
This is not language specific, but simple c language would be a great choice in responding to this.
This could be absolutely counterintuitive; it actually sounds as such, but I have to ask before presuming...
Please Note This is a made up hypothetical situation that I'm interested in. I am not looking for a solution to an existing problem, so alternative concurrency solutions are completely pointless. I have no code for it, and if I were in it I can think of a few alternative code engineering solutions to avoid this setup. I just wish to know if a thread can be usable, in some way, while waiting for a signal from another thread, and what synchronisation mechanism to use for that.
UPDATE
As I mentioned above, I know how to synchronise threads etc. Im only interested in the situation that I have presented here. Mutexes, semaphores and locks all kinds of mechanisms will all synchronise access to resources, synchronise order of events, synchronise all kinds of concurrently issues, yes. But Im not interested in how to do it properly. I just have this made up situation that I wish to know if it can be addressed with a mechanism as described prior.
UPDATE 2
It seems I have opened up a portal for people that think they are experts in concurrency to teleport and lecture at chance how they think the rest of world does not know how threading works. I simply asked if there is a mechanism for this situation, not a work around solution, not 'the proper way to synchronise', not a better way to do it. I already know what I would do and never be in this made up situation. It's simply hypothetical.
After much research, thought, and overview, I have come to the conclusion that its like asking:
If a calculator has the ability for me simply enter a series of 5 digits and automatically get their sum on the screen.
No, it does not have such a mode ready. But I can still get the sum with a few extra clicks using the plus and eventually the equal button.
If i really wanted a thread that can continue while listening for a condition of some sort, I could easily implement a personal class or object around the OS/kernel/SDK thread or whatever and make use of that.
• So at a low level, my answer is no, there is no such mechanism •
If a thread is waiting, then it's waiting. If it can continue executing then it is not really 'waiting', in the concurrency meaning of waiting. Otherwise there would be some other term for this state (Alert Waiting, anyone?). This is not to say it is not possible, just not with one simple low level predefined mechanism similar to a mutex or semaphore etc. One could wrap the required functionality in some class or object etc.
Having said that, there are Interrupts and Interrupt handlers, which come close to addressing this situation. However, an interrupt has to be defined, with its handler. The interrupts may actually be running on another thread (not to say a thread per interrupt). So a number of objects are involved here.
You have a misunderstanding about how mutexes are typically used.
If you want to do some work, you acquire the mutex to figure out what work you need to do. You do this because "what work you need to do" is shared between the thread that decide what work needed to be done and the thread that's going to do the work. But then you release the mutex that protects "what work you need to do" while you do the work.
Then, when you finish the work, you acquire the mutex that protects your report that the work is done. This is needed because the status of the work is shared with other threads. You set that status to "done" and then you release the mutex.
Notice that no thread holds the mutex for very long, just for the microscopic fraction of a second it needs to check on or modify shared state. So to see if work is done, you can acquire the mutex that protects the reporting of the status of that work, check the status, and then release the mutex. The thread doing the work will not hold that mutex for longer than the tiny fraction of a second it needs to change that status.
If you're holding mutexes so long that you worry at all about waiting for them to be released, you're either doing something wrong or using mutexes in a very atypical way.
So use a mutex to protect the status of the work. If you need to wait for work to be done, also use a condition variable. Only hold that mutex while changing, or checking, the status of the work.
But, If a thread attempts to acquire an already acquired mutex, that thread will be forced to wait until the thread that originally acquired the mutex releases it. So, while that thread is waiting, can it actually be usable. This is where my question is.
If you consider any case where one thread might slow another thread down to be "waiting", then you can never avoid waiting. All that has to happen is one thread accesses memory and that might slow another thread down. So what do you do, never access memory?
When we talk about one thread "waiting" for another, what we mean is waiting for the thread to do actual work. We don't worry about the microscopic overhead of inter-thread synchronization both because there's nothing we can do about it and because it's negligible.
If you literally want to find some way that one thread can never, ever slow another thread down, you'll have to re-design pretty much everything we use threads for.
Update:
For example, consider some code that has a mutex and a boolean. The boolean indicates whether or not the work is done. The "assign work" flow looks like this:
Create a work object with a mutex and a boolean. Set the boolean to false.
Dispatch a thread to work on that object.
The "do work" flow looks like this:
Do work. (The mutex is not held here.)
Acquire mutex.
Set boolean to true.
Release mutex.
The "is work done" flow looks like this:
Acquire mutex.
Copy boolean.
Release mutex.
Look at copied value.
This allows one thread to do work and another thread to check if the work is done any time it wants to while doing other things. The only case where one thread waits for the other is the one-in-a-million case where a thread that needs to check if the work is done happens to check right at the instant that the work has just finished. Even in that case, it will typically block for less than a microsecond as the thread that holds the mutex only needs to set one boolean and release the mutex. And if even that bothers you, most mutexes have a non-blocking "try to lock" function (which you would use in the "check if work is done" flow so that the checking thread never blocks).
And this is the normal way mutexes are used. Actual contention is the exception, not the rule.

Ways to detect deadlock in a live application

What are the ways to detect deadlocks in a live multi-threaded application?
If we found there is a deadlock, are there any ways to resolve it, without taking down/restarting the application?
There are two popular ways to detect deadlocks.
One is to have threads set checkpoints. For example, if you have a thread that has a work loop, you set a timer at the beginning of doing work that's set for longer than you think the work could possibly take. If the timer fires, you assume the thread is deadlocked. When the work is done, you cancel the timer.
Another (sometimes used in combination) is to have things that a thread might block on track what other resources a thread might hold. This can directly detect an attempt to acquire one lock while holding another one when other threads have acquired those locks in the opposite order.
This can even detect deadlock risk without the deadlock actually occurring. If one thread acquires lock A then B and another acquires lock B then A, there is no deadlock unless they overlap. But this method can detect it.
Advanced deadlock detection is typically only used during debugging. Other than coding the application to check each blocking lock for a possible deadlock and knowing what to do if it happens, the only thing you can do after a deadlock is tear the application down. You can't release locks blindly because the resources they protect may be in an inconsistent state.
Sometimes you deliberately write code that you know can deadlock and specifically code it to avoid the problem. For example, if you know lots of threads take lock A and then try to acquire lock B, and some other thread needs to do the reverse, you can code it do a non-blocking attempt to lock B and release lock A if it fails.
Typically, it's more useful to spend your effort making deadlocks impossible rather than making the code detect and work around deadlocks.
Python has a feature called the faulthandler that's very useful for dealing with deadlocks:
import faulthandler
faulthandler.register(signal.SIGUSR1)
If you're using C++ or any compiler that uses glibc, you can use the backtrace() functions in execinfo.h to print a stacktrace and exit gracefully when you get a signal. You can take a deadlocked program, send it a signal and get a list of all the threads.
In Java, use jstack <pid> on the stuck process.

Why are thread locks resources?

I recently read that thread locks are system resources, therefore they have to be properly released "just like memory". I realised I wasn't really aware of this.
Can someone offer some further elaboration on this fact or point to a good reference? More specifically: how can I think about the implementation of locks at a deeper system level? What are the possible consequences of leaking locks? Is there a maximum number of locks available in the system?
All that means is that you have to be careful that anything you lock gets released, similar to how you would be careful to close network connections or files or graphic device contexts or whatever. If you write code that is not careful about that, then you risk having the program deadlock or be unable to progress when it can't get access to something that's locked (because the point of locking is to make sure multiple threads can access something safely, so if one thread leaves something locked other threads that need to access it are shut out).
The program will have severe performance issues a long time before it runs out of physical locks, so typically you shouldn't have to worry about the number of available locks.

Can a thread be context-switched while holding a lock?

How do you take care of context-switching when a thread holds a lock and hence blocks other threads? I would expect it's a very common issue.
On a pre-emptive multi-tasking system, you can't prevent yourself from being switched out while holding a lock. But since anything else that's waiting on the lock (assuming that it's not a spinlock) can't be switched in, this is normally not a problem.
Using a spinlock is almost always a bad idea. There are legitimate cases where things can go badly if you hold a lock too long; you can manage this by ensuring that you hold the lock for the least amount of time possible and that you don't do anything that can block while holding it.
The blocked threads don't run, and eventually the thread holding the lock becomes active again, eventually releasing the lock. You don't have to do anything special beyond the usual nitpicky care needed with multithreaded code.

Is a lock (threading) atomic?

This may sound like a stupid question, but if one locks a resource in a multi-threaded app, then the operation that happens on the resource, is that done atomically?
I.E.: can the processor be interrupted or can a context switch occur while that resource has a lock on it? If it does, then nothing else can access this resource until it's scheduled back in to finish off it's process. Sounds like an expensive operation.
The processor can very definitely still switch to another thread, yes. Indeed, in most modern computers there can be multiple threads running simultaneously anyway. The locking just makes sure that no other thread can acquire the same lock, so you can make sure that an operation on that resource is atomic in terms of that resource. Code using other resources can operate completely independently.
You should usually lock for short operations wherever possible. You can also choose the granularity of locks... for example, if you have two independent variables in a shared object, you could use two separate locks to protect access to those variables. That will potentially provide better concurrency - but at the same time, more locks means more complexity and more potential for deadlock. There's always a balancing act when it comes to concurrency.
You're exactly right. That's one reason why it's so important to lock for short period of time. However, this isn't as bad as it sounds because no other thread that's waiting on the lock will get scheduled until the thread holding the lock releases it.
Yes, a context switch can definitely occur.
This is exactly why when accessing a shared resource it is important to lock it from another thread as well. When thread A has the lock, thread B cannot access the code locked.
For example if two threads run the following code:
1. lock(l);
2. -- change shared resource S here --
3. unlock(l);
A context switch can occur after step 1, but the other thread cannot hold the lock at that time, and therefore, cannot change the shared resource. If access to the shared resource on one of the threads is done without a lock - bad things can happen!
Regarding the wastefulness, yes, it is a wasteful method. This is why there are methods that try to avoid locks altogether. These methods are called lock-free, and some of them are based on strong locking services such as CAS (Compare-And-Swap) or others.
No, it's not really expensive. There are typically only two possibilities:
1) The system has other things it can do: In this case, the system is still doing useful work with all available cores.
2) The system doesn't have anything else to do: In this case, the thread that holds the lock will be scheduled. A sane system won't leave a core unused while there's a ready-to-run thread that's not scheduled.
So, how can it be expensive? If there's nothing else for the system to do that doesn't require acquiring that lock (or not enough other things to occupy all cores) and the thread holding the lock is not ready-to-run. So that's the case you have to avoid, and the context switch or pre-empt issue doesn't matter (since the thread would be ready-to-run).

Resources