Is work_queue thread safe?

Is work_queue thread safe? - linux

Looking at workqueue.c it appears as though only parts that are locked properly are between the publicly exposed APIs and the internal thread that runs. There seem to be some things outside the critical section (which to my untrained eye) that could be unsafe?
Am I correct or way off base?

I presume you are talking about workqueue.c in the Linux kernel?
http://lxr.linux.no/#linux+v3.2.9/kernel/workqueue.c
In that case, only use the public API, assume it is threadsafe and report any issues you see to Ingo Molnar. Note that most kernel developers are seriously smart and will not redo the "big lock" mistake ever again: not everything is run under a mutex because not everything needs to. Tricks like RCU (read copy update) also reduce the amount of locking needed.
And everything without a lock will perform a lot better.

Related

Multi threading analysis techniques

Does anyone know of any analysis techniques that can be used to design/debug thread locking and unlocking sequences? Essentially a technique (like a truth table) I can use to prove that my sequence of locks won't deadlock.
This is not the sort of problem that programming by trial and error works well in.
My particular problem is a read write lock - but I ask this in the general sense. I believe it would be a useful technique to learn if one exists.
I have tried a causal graph in which I have boxes and arrows that I can use to follow the flow of control and that has solved 80% of my problem. But I am still getting occasional deadlocks under stress testing when one thread sneaks though the "gap between instructions" if that makes any sense.
To summarize; what I need is some way of representing the problem so that I can formally analyze the overlap of mutex locks.

Bad news I'm afraid. There are no techniques that I know of that can "prove" that a system that uses locks to control access to shared memory. By "prove" I mean that you cannot demonstrate analytically that a program won't deadlock, livelock, etc.
The problem is that threads run asynchronously. As soon as you start having a sensible number of threads and shared resources, the number of possible sequences of events (e.g. locking/unlocking shared resources) is astronomically high and you cannot model / analyse each and every one of them.
For this reason Communicating Sequential Processes was developed by Tony Hoare, way back in 1978. It is a development of the Actor model which itself goes a long way to resolving the problem.
Actor and CSP
Briefly, in the Actor model data is not communicated via shared memory with a lock. Instead a copy is sent down a communications channel of some sort (e.g. a socket, or pipe) between two threads. This means that you're never locking memory. In effect all memory is private to threads, with copies of it being sent as and when required to other threads. It's a very 'object orientated' thing; private data (thread-owned memory), public interface (messages emitted and received on communications channels). It's also very scalable - pipes can become sockets, threads can become processes on other computers.
The CSP model is just like that, except that the communications channel won't accept a message unless the receiving end is ready to read it.
This addition is crucial - it means that a system design can be analysed algebraically. Indeed Tony Hoare formulated a process calculi for CSP. The Wikipedia page on CSP cites use of this to prove an eCommerce system's design.
So if one is developing a strict CSP system, it is possible to prove analytically that it cannot deadlock, etc.
Real World Experience
I've done many a CSP (or CSP-ish) system, and it's always been good. Instead of doing the maths I've used intuition to help me avoid problems. In effect CSP ensures that if I've gone and built a system that can deadlock, it will deadlock every time. So at least I find it in development, not 2 years later when some network link gets a bit busier than normal.
Real World Options
For Actor model programming there's a lot of options. ZeroMQ, nanomsg, Microsoft's .NET Data Flow library.
They're all pretty good, and with care you can make a system that'll be pretty good. I like ZeroMQ and nanomsg a lot - they make it trivial to split a bunch of threads up into separate processes on separate computers and you've not changed the architecture at all. If absolute performance isn't essential coupling these two up with, for example, Google Protocol Buffers makes for a really tidy system with huge options for incorporating different OSes, languages and systems into your design.
I suspect that MS's DataFlow library for .NET moves owner of references to the data around instead of copying it. That ought to make it pretty performant (though I've not actually tried it to see).
CSP is a bit harder to come by. You can nearly make ZeroMQ and DataFlow into CSP by setting message buffer lengths. Unfortunately you cannot set the buffer length to zero (which is what would make it CSP). MS's documentation even talks about the benefits to system robustness achieved by setting the queue length to 1.
You can synthesize CSP on top of Actor by having flows of synchronisation messages across the links. This is annoying to have to implement.
I've quite often spun up my own comms framework to get a CSP environment.
There's libraries for Java I think, don't know how actively developed they are.
However as you have existing code written around locked shared memory it'll be a tough job to adapt your code. So....
Kernel Shark
If you're on Linux and your kernel has FTRACE compiled in you can use Kernel Shark to see what has happened in your system. Similarly with DTRACE on Solaris, WindView on VxWorks, TATL on MCOS.
What you do is run your system until it stops, and then very quickly preserve the FTRACE log (it gets overwritten in a circular buffer by the OS). You can then see graphically what has happened (turn on Kernel Shark's process view), which may give clues as to what did what and when.
This helps you diagnose your application's deadlock, which may lead you towards getting things right, but ultimately you can never prove that it is correct this way. That doesn't stop you having a Eureka moment where you now know in your bones that you've got it right.
I know of no equivalent of FTRACE / Kernel shark for Windows.

For a broad range of multithreading tasks, we can draw a graph which reflects the order of locking of resources. If that graph has cycles, this means that deadlock is well possible. If there is no cycles, deadlock never occur.
For example, consider the Dining Philosophers task. If each philosopher takes left fork first, and then the right fork, then the graph of order of locking is a ring connecting all the forks. Deadlock is very possible in this situation. However, if one of philosophers changes his order, the ring become a line and deadlock would never occur. If all philosophers change their order and all would take right fork first, the graph again shapes a ring and deadlock is real.

How does the Linux kernel realize reentrancy?

All Unix kernels are reentrant: several processes may be executing in kernel
mode at the same time. How can I realize this effect in code? How should I handle the situation whereby many processes invoke system calls, pending in kernel mode?

[Edit - the term "reentrant" gets used in a couple of different senses. This answer uses the basic "multiple contexts can be executing the same code at the same time." This usually applies to a single routine, but can be extended to apply to a set of cooperating routines, generally routines which share data. An extreme case of this is when applied to a complete program - a web server, or an operating system. A web-server might be considered non-reentrant if it could only deal with one client at a time. (Ugh!) An operating system kernel might be called non-reentrant if only one process/thread/processor could be executing kernel code at a time.
Operating systems like that occurred during the transition to multi-processor systems. Many went through a slow transition from written-for-uniprocessors to one-single-lock-protects-everything (i.e. non-reentrant) through various stages of finer and finer grained locking. IIRC, linux finally got rid of the "big kernel lock" at approx. version 2.6.37 - but it was mostly gone long before that, just protecting remnants not yet converted to a multiprocessing implementation.
The rest of this answer is written in terms of individual routines, rather than complete programs.]
If you are in user space, you don't need to do anything. You call whatever system calls you want, and the right thing happens.
So I'm going to presume you are asking about code in the kernel.
Conceptually, it's fairly simple. It's also pretty much identical to what happens in a multi-threaded program in user space, when multiple threads call the same subroutine. (Let's assume it's a C program - other languages may have differently named mechanisms.)
When the system call implementation is using automatic (stack) variables, it has its own copy - no problem with re-entrancy. When it needs to use global data, it generally needs to use some kind of locking - the specific locking required depends on the specific data it's using, and what it's doing with that data.
This is all pretty generic, so perhaps an example might help.
Let's say the system call want to modify some attribute of a process. The process is represented by a struct task_struct which is a member of various linked lists. Those linked lists are protected by the tasklist_lock. Your system call gets the tasklist_lock, finds the right process, possibly gets a per-process lock controlling the field it cares about, modifies the field, and drops both locks.
One more detail, which is the case of processes executing different system calls, which don't share data with each other. With a reasonable implementation, there are no conflicts at all. One process can get itself into the kernel to handle its system call without affecting the other processes. I don't remember looking specifically at the linux implementation, but I imagine it's "reasonable". Something like a trap into an exception handler, which looks in a table to find the subroutine to handle the specific system call requested. The table is effectively const, so no locks required.

Runtime integrity check of executed files

I just finished writing a linux security module which verifies the integrity of executable files at the start of their execution (using digital signatures). Now I want to dig a little bit deeper and want to check the files' integrity during run-time (i.e. periodically check them - since I am mostly dealing with processes that get started and run forever...) so that an attacker is not able to change the file within main memory without being identified (at least after some time).
The problem here is that I have absolutely no clue how I can check the file's current memory image. My authentication method mentioned above makes use of a mmap-hook which gets called whenever a file is mmaped before its execution, but as far as I know the LSM framework does not provide tools for periodical checks.
So my question: Are there any hints how I shoudl start this? How I can read a memory image and check its integrity?
Thank you

I understand what you're trying to do, but I'm really worried that this may be a security feature that gives you a warm fuzzy feeling for no good reason; and those are the most dangerous kinds of security features to have. (Another example of this might be the LSM sitting right next to yours, SElinux. Although I think I'm in the minority on this opinion...)
The program data of a process is not the only thing that affects its behavior. Stack overflows, where malicious code is written into the stack and jumped into, make integrity checking of the original program text moot. Not to mention the fact that an attacker can use the original unchanged program text to his advantage.
Also, there are probably some performance issues you'll run into if you are constantly computing DSA inside the kernel. And, you're adding that much more to long list of privileged kernel code that could be possibly exploited later on.
In any case, to address the question: You can possibly write a kernel module that instantiates a kernel thread that, on a timer, hops through each process and checks its integrity. This can be done by using the page tables for each process, mapping in the read only pages, and integrity checking them. This may not work, though, as each memory page probably needs to have its own signature, unless you concatenate them all together somehow.
A good thing to note is that shared libraries only need to be integrity checked once per sweep, since they are re-mapped across all the processes that use them. It takes sophistication to implement this though, so maybe have this under this "nice-to-have" section of your design.
If you disagree with my rationale that this may not be a good idea, I'd be very interested in your thoughts. I ran into this idea at work a while ago, and it would be nice to bring fresh ideas to our discussion.

Synchronization of threads slows down a multithreaded application

I have a multithreaded application written in c#. What i noticed is that implementing thread synchronization with lock(this) method slows down the application by 20%. Is that an expected behavior or should i look into the implementation closer?

Locking does add some overhead, that can't be avoided. It is also very likely that some of your threads now will be waiting on resources to be released, rather than just grabbing them when they feel like. If you implemented thread synchronization correctly, then that is a good thing.
But in general, your question can't be answered without intimate knowledge about the application. 20 % slowdown might be OK, but you might be locking too broadly, and then the program would (in general) be slower.
Also, please dont use lock(this). If your instance is passed around and someone else locks on the reference, you will have a deadlock. Best practice is to lock on a private object that noone else can access.

Depending on how coarse or granular your lock() statements are, you can indeed impact the performance of your MT app. Only lock things you really know are supposed to be locked.

Any synchronization will slow down multithreading.
That being said, lock(this) is really never a good idea. You should always lock on a private object used for nothing but synchronization when possible.
Make sure to keep your locking to a minimum, and only hold the lock for as short of a time as possible. This will help keep the "slowdown" to a minimum.

There are performance counters you can monitor in Windows to see how much time your application spends contending for locks.

How can I write a lock free structure?

In my multithreaded application and I see heavy lock contention in it, preventing good scalability across multiple cores. I have decided to use lock free programming to solve this.
How can I write a lock free structure?

Short answer is:
You cannot.
Long answer is:
If you are asking this question, you do not probably know enough to be able to create a lock free structure. Creating lock free structures is extremely hard, and only experts in this field can do it. Instead of writing your own, search for an existing implementation. When you find it, check how widely it is used, how well is it documented, if it is well proven, what are the limitations - even some lock free structure other people published are broken.
If you do not find a lock free structure corresponding to the structure you are currently using, rather adapt the algorithm so that you can use some existing one.
If you still insist on creating your own lock free structure, be sure to:
start with something very simple
understand memory model of your target platform (including read/write reordering constraints, what operations are atomic)
study a lot about problems other people encountered when implementing lock free structures
do not just guess if it will work, prove it
heavily test the result
More reading:
Lock free and wait free algorithms at Wikipedia
Herb Sutter: Lock-Free Code: A False Sense of Security

Use a library such as Intel's Threading Building Blocks, it contains quite a few lock -free structures and algorithms. I really wouldn't recommend attempting to write lock-free code yourself, it's extremely error prone and hard to get right.

Writing thread-safe lock free code is hard; but this article from Herb Sutter will get you started.

As sblundy pointed out, if all objects are immutable, read-only, you don't need to worry about locking, however, this means you may have to copy objects a lot. Copying usually involves malloc and malloc uses locking to synchronize memory allocations across threads, so immutable objects may buy you less than you think (malloc itself scales rather badly and malloc is slow; if you do a lot of malloc in a performance critical section, don't expect good performance).
When you only need to update simple variables (e.g. 32 or 64 bit int or pointers), perform simply addition or subtraction operations on them or just swap the values of two variables, most platforms offer "atomic operations" for that (further GCC offers these as well). Atomic is not the same as thread-safe. However, atomic makes sure, that if one thread writes a 64 bit value to a memory location for example and another thread reads from it, the reading one either gets the value before the write operation or after the write operation, but never a broken value in-between the write operation (e.g. one where the first 32 bit are already the new, the last 32 bit are still the old value! This can happen if you don't use atomic access on such a variable).
However, if you have a C struct with 3 values, that want to update, even if you update all three with atomic operations, these are three independent operations, thus a reader might see the struct with one value already being update and two not being updated. Here you will need a lock if you must assure, the reader either sees all values in the struct being either the old or the new values.
One way to make locks scale a lot better is using R/W locks. In many cases, updates to data are rather infrequent (write operations), but accessing the data is very frequent (reading the data), think of collections (hashtables, trees). In that case R/W locks will buy you a huge performance gain, as many threads can hold a read-lock at the same time (they won't block each other) and only if one thread wants a write lock, all other threads are blocked for the time the update is performed.
The best way to avoid thread-issues is to not share any data across threads. If every thread deals most of the time with data no other thread has access to, you won't need locking for that data at all (also no atomic operations). So try to share as little data as possible between threads. Then you only need a fast way to move data between threads if you really have to (ITC, Inter Thread Communication). Depending on your operating system, platform and programming language (unfortunately you told us neither of these), various powerful methods for ITC might exist.
And finally, another trick to work with shared data but without any locking is to make sure threads don't access the same parts of the shared data. E.g. if two threads share an array, but one will only ever access even, the other one only odd indexes, you need no locking. Or if both share the same memory block and one only uses the upper half of it, the other one only the lower one, you need no locking. Though it's not said, that this will lead to good performance; especially not on multi-core CPUs. Write operations of one thread to this shared data (running one core) might force the cache to be flushed for another thread (running on another core) and these cache flushes are often the bottle neck for multithread applications running on modern multi-core CPUs.

As my professor (Nir Shavit from "The Art of Multiprocessor Programming") told the class: Please don't. The main reason is testability - you can't test synchronization code. You can run simulations, you can even stress test. But it's rough approximation at best. What you really need is mathematical correctness proof. And very few capable understanding them, let alone writing them.
So, as others had said: use existing libraries. Joe Duffy's blog surveys some techniques (section 28). The first one you should try is tree-splitting - break to smaller tasks and combine.

Immutability is one approach to avoid locking. See Eric Lippert's discussion and implementation of things like immutable stacks and queues.

in re. Suma's answer, Maurice Herlithy shows in The Art of Multiprocessor Programming that actually anything can be written without locks (see chapter 6). iirc, This essentially involves splitting tasks into processing node elements (like a function closure), and enqueuing each one. Threads will calculate the state by following all nodes from the latest cached one. Obviously this could, in worst case, result in sequential performance, but it does have important lockless properties, preventing scenarios where threads could get scheduled out for long peroids of time when they are holding locks. Herlithy also achieves theoretical wait-free performance, meaning that one thread will not end up waiting forever to win the atomic enqueue (this is a lot of complicated code).
A multi-threaded queue / stack is surprisingly hard (check the ABA problem). Other things may be very simple. Become accustomed to while(true) { atomicCAS until I swapped it } blocks; they are incredibly powerful. An intuition for what's correct with CAS can help development, though you should use good testing and maybe more powerful tools (maybe SKETCH, upcoming MIT Kendo, or spin?) to check correctness if you can reduce it to a simple structure.
Please post more about your problem. It's difficult to give a good answer without details.
edit immutibility is nice but it's applicability is limited, if I'm understanding it right. It doesn't really overcome write-after-read hazards; consider two threads executing "mem = NewNode(mem)"; they could both read mem, then both write it; not the correct for a classic increment function. Also, it's probably slow due to heap allocation (which has to be synchronized across threads).

Inmutability would have this effect. Changes to the object result in a new object. Lisp works this way under the covers.
Item 13 of Effective Java explains this technique.

Cliff Click has dome some major research on lock free data structures by utilizing finite state machines and also posted a lot of implementations for Java. You can find his papers, slides and implementations at his blog: http://blogs.azulsystems.com/cliff/

Use an existing implementation, as this area of work is the realm of domain experts and PhDs (if you want it done right!)
For example there is a library of code here:
http://www.cl.cam.ac.uk/research/srg/netos/lock-free/

Most lock-free algorithms or structures start with some atomic operation, i.e. a change to some memory location that once begun by a thread will be completed before any other thread can perform that same operation. Do you have such an operation in your environment?
See here for the canonical paper on this subject.
Also try this wikipedia article article for further ideas and links.

The basic principle for lock-free synchronisation is this:
whenever you are reading the structure, you follow the read with a test to see if the structure was mutated since you started the read, and retry until you succeed in reading without something else coming along and mutating while you are doing so;
whenever you are mutating the structure, you arrange your algorithm and data so that there is a single atomic step which, if taken, causes the entire change to become visible to the other threads, and arrange things so that none of the change is visible unless that step is taken. You use whatever lockfree atomic mechanism exists on your platform for that step (e.g. compare-and-set, load-linked+store-conditional, etc.). In that step you must then check to see if any other thread has mutated the object since the mutation operation began, commit if it has not and start over if it has.
There are plenty of examples of lock-free structures on the web; without knowing more about what you are implementing and on what platform it is hard to be more specific.

If you are writing your own lock-free data structures for a multi-core cpu, do not forget about memory barriers! Also, consider looking into Software Transaction Memory techniques.

Well, it depends on the kind of structure, but you have to make the structure so that it carefully and silently detects and handles possible conflicts.
I doubt you can make one that is 100% lock-free, but again, it depends on what kind of structure you need to build.
You might also need to shard the structure so that multiple threads work on individual items, and then later on synchronize/recombine.

As mentioned, it really depends on what type of structure you're talking about. For instance, you can write a limited lock-free queue, but not one that allows random access.

Reduce or eliminate shared mutable state.

In Java, utilize the java.util.concurrent packages in JDK 5+ instead of writing your own. As was mentioned above, this is really a field for experts, and unless you have a spare year or two, rolling your own isn't an option.

Can you clarify what you mean by structure?
Right now, I am assuming you mean the overall architecture. You can accomplish it by not sharing memory between processes, and by using an actor model for your processes.

Take a look at my link ConcurrentLinkedHashMap for an example of how to write a lock-free data structure. It is not based on any academic papers and doesn't require years of research as others imply. It simply takes careful engineering.
My implementation does use a ConcurrentHashMap, which is a lock-per-bucket algorithm, but it does not rely on that implementation detail. It could easily be replaced with Cliff Click's lock-free implementation. I borrowed an idea from Cliff, but used much more explicitly, is to model all CAS operations with a state machine. This greatly simplifies the model, as you'll see that I have psuedo locks via the 'ing states. Another trick is to allow laziness and resolve as needed. You'll see this often with backtracking or letting other threads "help" to cleanup. In my case, I decided to allow dead nodes on the list be evicted when they reach the head, rather than deal with the complexity of removing them from the middle of the list. I may change that, but I didn't entirely trust my backtracking algorithm and wanted to put off a major change like adopting a 3-node locking approach.
The book "The Art of Multiprocessor Programming" is a great primer. Overall, though, I'd recommend avoiding lock-free designs in the application code. Often times it is simply overkill where other, less error prone, techniques are more suitable.

If you see lock contention, I would first try to use more granular locks on your data structures rather than completely lock-free algorithms.
For example, I currently work on multithreaded application, that has a custom messaging system (list of queues for each threads, the queue contains messages for thread to process) to pass information between threads. There is a global lock on this structure. In my case, I don't need speed so much, so it doesn't really matter. But if this lock would become a problem, it could be replaced by individual locks at each queue, for example. Then adding/removing element to/from the specific queue would didn't affect other queues. There still would be a global lock for adding new queue and such, but it wouldn't be so much contended.
Even a single multi-produces/consumer queue can be written with granular locking on each element, instead of having a global lock. This may also eliminate contention.

If you read several implementations and papers regarding the subject, you'll notice there is the following common theme:
1) Shared state objects are lisp/clojure style inmutable: that is, all write operations are implemented copying the existing state in a new object, make modifications to the new object and then try to update the shared state (obtained from a aligned pointer that can be updated with the CAS primitive). In other words, you NEVER EVER modify an existing object that might be read by more than the current thread. Inmutability can be optimized using Copy-on-Write semantics for big, complex objects, but thats another tree of nuts
2) you clearly specify what allowed transitions between current and next state are valid: Then validating that the algorithm is valid become orders of magnitude easier
3) Handle discarded references in hazard pointer lists per thread. After the reference objects are safe, reuse if possible
See another related post of mine where some code implemented with semaphores and mutexes is (partially) reimplemented in a lock-free style:
Mutual exclusion and semaphores

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string