I have a problem which is essentially a series of searches for multiple copies of items (needles) in a massive but in memory database (10s of Gb) - the haystack.
This is divided into tasks where each task is to find each of a series of needles in the haystack
and each task is logically independent from the other tasks.
(This is already distributed across multiple machines where each machine
has its own copy of the haystack.)
There are many ways this could be parallelized on individual machines.
We could have one search process per CPU core sharing memory.
Or we could have one search process with multiple threads (one per core). Or even several multi-threaded processes.
3 possible architectures:
A process loads the haystack into Posix shared memory.
Subsequent processes use the shared memory segment instead (like a cache)
A process loads the haystack into memory and then forks.
Each process uses the same memory because of copy on write semantics.
A process loads the haystack into memory and spawns multiple search threads
The question is one method likely to be better and why? or rather what are the trade offs.
(For argument's sake assume performance trumps implementation complexity).
Implementing two or three and measuring is possible of course but hard work.
Are there any reasons why one might be definitively better?
Data in the haystack is immutable.
The processes are running on Linux. So processes are not significantly more expensive than threads.
The haystack spans many GBs so CPU caches are not likely to help.
The search process is essentially a binary search (actually equal_range with a touch of interpolation).
Because the tasks are logically independent there is no benefit from inter-thread communication being
cheaper than inter-process communication (as per for example https://stackoverflow.com/a/18114475/1569204).
I cannot think of any obvious performance trade-offs between threads and shared memory here. Are there any? Perhaps the code maintenance trade-offs are more relevant?
Background research
The only relevant SO answer I could find refers to the overhead of synchronising threads - Linux: Processes and Threads in a Multi-core CPU - which is true but less applicable here.
Related and interesting but different questions are:
Multithreading: What is the point of more threads than cores?
Performance difference between IPC shared memory and threads memory
performance - multithreaded or multiprocess applications
An interesting presentation is https://elinux.org/images/1/1c/Ben-Yossef-GoodBadUgly.pdf
It suggests there can be a small difference in the speed of thread vs process context switches.
I am assuming that except for a monitoring threads/process the others are never switched out.
General advise: Be able to measure improvements! Without that, you may tweak all you like based on advise off the internet but still don't get optimal performance. Effectively, I'm telling you not to trust me or anyone else (including yourself) but to measure. Also prepare yourself for measuring this in real time on production systems. A benchmark may help you to some extent, but real load patterns are still a different beast.
Then, you say the operations are purely in-memory, so the speed doesn't depend on (network or storage) IO performance. The two bottlenecks you face are CPU and RAM bandwidth. So, in order to work on the right part, find out which is the limiting factor. Making sure that the according part is efficient ensures optimal performance for your searches.
Further, you say that you do binary searches. This basically means you do log(n) comparisons, where each comparison requires a load of a certain element from the haystack. This load probably goes through all caches, because the size of the data makes cache hits very unlikely. However, you could hold multiple needles to search for in cache at the same time. If you then manage to trigger the cache loads for the needles first and then perform the comparison, you could reduce the time where either CPU or RAM are idle because they wait for new operations to perform. This is obviously (like others) a parameter you need to tweak for the system it runs on.
Even further, reconsider binary searching. Binary searching performs reliably with a good upper bound on random data. If you have any patterns (i.e. anything non-random) in your data, try to exploit this knowledge. If you can roughly estimate the location of the needle you're searching for, you may thus reduce the number of lookups. This is basically moving the work from the RAM bus to the CPU, so it again depends which is the actual bottleneck. Note that you can also switch algorithms, e.g. going from an educated guess to a binary search when you have less than a certain amount of elements left to consider.
Lastly, you say that every node has a full copy of your database. If each of the N nodes is assigned one Nth of the database, it could improve caching. You'd then make one first step at locating the element to determine the node and then dispatch the search to the responsible node. If in doubt, every node can still process the search as a fallback.
The modern approach is to use threads and a single process.
Whether that is better than using multiple processes and a shared memory segment might depend somewhat on your personal preference and how easy threads are to use in the language you are using, but I would say that if decent thread support is available (e.g. Java) you are pretty much always better off using it.
The main advantage of using multiple processes as far as I can see is that it is impossible to run into the kind of issues you can get when managing multiple threads (e.g., forgetting to synchronise access to shared writable resources - except for the shared memory pool). However, thread-safety by not having threads at all is not much of an argument in favour.
It might also be slightly easier to add processes than add threads. You would have to write some code to change the number of processing threads online (or use a framework or application server).
But overall, the multiple-process approach is dead. I haven't used shared memory in decades. Threads have won the day and it is worth the investment to learn to use them.
If you do need to have multi-threaded access to common writable memory then languages like Java give you all sorts of classes for doing that (as well as language primitives). At some point you are going to find you want that and then with the multi-process approach you are faced with synchronising using semaphores and writing your own classes or maybe looking for a third party library, but the Java people will be miles ahead by then.
You also mentioned forking and relying on copy-on-write. This seems like a very fragile solution dependent on particular behaviour of the system and I would not myself use it.
Let's suppose I have a big memory buffer used as a framebuffer, what is constantly written by a thread (or even multiple threads, guaranteed that no two threads write the same byte concurrently). These writes are indeterministic in time, scattered through the codebase, and cannot be blocked.
I have another single thread which periodically reads out (copies) the whole buffer for generating a display frame. This read should not be blocked, too. Tearing is not a problem in my case. In other words, my only goal is that every change done by the write thread(s) should eventually appear in the reading thread. The ordering or some (negligible compared to a display refresh rate) delay does not matter.
Reading and writing the same memory location concurrently is a data race, which results in an undefined behavior in c++11, and this article lists same really dreadful examples where the optimizing compiler generates code for a memory read that alters the memory contents in the presence of data race.
Still, I need some solution without completely redesigning this legacy code. Every advice counts what is safe from practical standpoints, independent of if it is theoretically correct or not. I am also open to not-fully-portable solutions, too.
Aside from that I have a data race, I can easily force the visibility of the buffer changes in the reading thread by establishing a synchronizes-with relation between the threads (acquire-release an atomic guard variable, used for nothing else), or by adding platform-specific memory fence calls to key points in the writer thread(s).
My ideas to target the data race:
Use assembly for the reading thread. I would try to avoid that.
Make the memory buffer volatile, thus preventing the compiler to optimize such nasty things what are described in the referenced article.
Put the reading thread's code in a separate compile unit, and compile with -O0
+1. Leave everything as is, and cross my fingers (as currently I do not notice issues) :)
What is the safest from the list above? Do you see a better solution?
FYI, the target platform is ARM (with multiple cores) and x86 (for testing).
(This question is concretizing a previous one what was a little too generic.)
For a project I am thinking about using pthread reader-writer locks or fcntl()-based file locks. I have to choose on of them. Could you please explain the differences between them? What are the advantages and disadvantages?
They're two completely different tools and are generally used for different tasks. A fair or complete contrast between the two is difficult as it's like comparing apples to couches.
TL;DR:
fcntl(2):
Advisory locking interface
Primarily works on files
Easily works between multiple processes
pthread_rwlock:
Advisory locking interface
Serializes access to conflicting operations (writes/reads, writes/writes, reads/writes)
Provides safety for shared resources (memory, file descriptors, etc) between multiple threads in a single process
fcntl(2)-based locks implement POSIX advisory locks on files, or ranges of bytes within the files. As they are advisory, nothing enforces these locks -- all processes (or threads) must cooperate and respect the semantics of the locks for them to be effective. For example, consider two processes A and B operating on some file f. If process A sets locks on f, B can completely ignore these locks and do whatever it likes. In general, this interface is used to protect access to entire files (or ranges within a file) between multiple threads and / or processes.
The pthread_rwlock interface could also be considered an advisory locking system (all threads must use the API for the locking to be effective). However, it is not implemented on top of files and is not limited in scope to protecting access to files. A reader-writer lock is a form of shared memory mutual exclusion interface such that multiple readers may concurrently execute a critical section while writers are blocked, or such that individual writers may execute a critical section, blocking all other concurrent readers and writers. In general, this interface is used to safeguard access to shared mutable state (possibly shared memory, possibly file access) between multiple threads in a process in read-mostly workloads. This API is not typically used to protect concurrent access in multiple processes.
If I were faced with the decision on picking one of these interfaces for serializing access to some data, I'd expect to ask myself at least a couple of questions:
Am I primarily working on files?
Do I have multiple processes?
If the intent is largely to protect access to a file, but only ever in a single process, I might settle for using pthread_rwlock. The downside to this approach is that if I ever needed to use multiple processes to access the file in the future, I wouldn't have a good way to express my locking intent to those other processes. Similarly, if I'm primarily trying to serialize access to some shared memory, I would use pthread_rwlock because the fcntl(2) interface expresses some intent on a file.
When trying to cooperate between multiple processes reading and writing a single file, I might use fcntl(2). However, this is likely to become very complicated, very quickly. For example, how do I handle events like truncation? Consider a case where process A has read 1024 bytes into a file that is then truncated to 0 bytes by process B. A must then seek to the beginning of the file and wait for new data to be written to continue reading without errors -- or to correctly append new data itself!
Solving these issues requires more locked communication in additional files, and the complexity can quickly spiral out of control. If I was required to implement some sort of concurrent system working on a file, I'd likely choose multiple threads and use the pthread_rwlock API. It's just easier to manage the totality of updates required to implement such a system in a single process. Without knowing the requirements you're faced with, it's rather difficult to guide one way or another.
When performing many disk operations, does multithreading help, hinder, or make no difference?
For example, when copying many files from one folder to another.
Clarification: I understand that when other operations are performed, concurrency will obviously make a difference. If the task was to open an image file, convert to another format, and then save, disk operations can be performed concurrently with the image manipulation. My question is when the only operations performed are disk operations, whether concurrently queuing and responding to disk operations is better.
Most of the answers so far have had to do with the OS scheduler. However, there is a more important factor that I think would lead to your answer. Are you writing to a single physical disk, or multiple physical disks?
Even if you parallelize with multiple threads...IO to a single physical disk is intrinsically a serialized operation. Each thread would have to block, waiting for its chance to get access to the disk. In this case, multiple threads are probably useless...and may even lead to contention problems.
However, if you are writing multiple streams to multiple physical disks, processing them concurrently should give you a boost in performance. This is particularly true with managed disks, like RAID arrays, SAN devices, etc.
I don't think the issue has much to do with the OS scheduler as it has more to do with the physical aspects of the disk(s) your writing to.
That depends on your definition of "I/O bound" but generally multithreading has two effects:
Use multiple CPUs concurrently (which won't necessarily help if the bottleneck is the disk rather than the CPU[s])
Use a CPU (with a another thread) even while one thread is blocked (e.g. waiting for I/O completion)
I'm not sure that Konrad's answer is always right, however: as a counter-example, if "I/O bound" just means "one thread spends most of its time waiting for I/O completion instead of using the CPU", but does not mean that "we've hit the system I/O bandwidth limit", then IMO having multiple threads (or asynchronous I/O) might improve performance (by enabling more than one concurrent I/O operation).
I would think it depends on a number of factors, like the kind of application you are running, the number of concurrent users, etc.
I am currently working on a project that has a high degree of linear (reading files from start to finish) operations. We use a NAS for storage, and were concerned about what happens if we run multiple threads. Our initial thought was that it would slow us down because it would increase head seeks. So we ran some tests and found out that the ideal number of threads is the same as the number of cores in the computer.
But your mileage may vary.
It can do, simply because whenever there is more work for a thread to do (identifying the next file to copy) the OS wakes it up, so threads are a simple way to hook into the OS scheduler and yet still write code in a traditional sequential way, instead of having to break it up into a state machine with callbacks.
This is mainly an assistance with clear programming rather than performance.
In most cases, using multi-thread for disk IO will not benefit efficiency. Let's imagine 2 circumstances:
Lock-Free File: We can split the file for each thread by giving them different IO offset. For instance, a 1024B bytes file is split into n pieces and each thread writes the 1024/n respectively. This will cause a lot of verbose disk head movement because of the different offset.
Lock File: Actually lock the IO operation for each critical section. This will cause a lot of verbose thread switches and it turns out that only one thread can write the file simultaneously.
Correct me if I' wrong.
No, it makes no sense. At some point, the operations have to be serialized (by the OS). On the other hand, since modern OS's have to cope with multiple processes anyway I doubt that there's an added overhead.
I'd think it would hinder the operations... You only have one controller and one drive.
You could use a second thread to do the operation, and a main thread that shows an updated UI.
I think it could worsen the performance, because the multiple threads will compete for the same resources.
You can test the impact of doing concurrent IO operations on the same device by copying a set of files from one place to another and measuring the time, then split the set in two parts and make the copies in parallel... the second option will be sensibly slower.
In my multithreaded application and I see heavy lock contention in it, preventing good scalability across multiple cores. I have decided to use lock free programming to solve this.
How can I write a lock free structure?
Short answer is:
You cannot.
Long answer is:
If you are asking this question, you do not probably know enough to be able to create a lock free structure. Creating lock free structures is extremely hard, and only experts in this field can do it. Instead of writing your own, search for an existing implementation. When you find it, check how widely it is used, how well is it documented, if it is well proven, what are the limitations - even some lock free structure other people published are broken.
If you do not find a lock free structure corresponding to the structure you are currently using, rather adapt the algorithm so that you can use some existing one.
If you still insist on creating your own lock free structure, be sure to:
start with something very simple
understand memory model of your target platform (including read/write reordering constraints, what operations are atomic)
study a lot about problems other people encountered when implementing lock free structures
do not just guess if it will work, prove it
heavily test the result
More reading:
Lock free and wait free algorithms at Wikipedia
Herb Sutter: Lock-Free Code: A False Sense of Security
Use a library such as Intel's Threading Building Blocks, it contains quite a few lock -free structures and algorithms. I really wouldn't recommend attempting to write lock-free code yourself, it's extremely error prone and hard to get right.
Writing thread-safe lock free code is hard; but this article from Herb Sutter will get you started.
As sblundy pointed out, if all objects are immutable, read-only, you don't need to worry about locking, however, this means you may have to copy objects a lot. Copying usually involves malloc and malloc uses locking to synchronize memory allocations across threads, so immutable objects may buy you less than you think (malloc itself scales rather badly and malloc is slow; if you do a lot of malloc in a performance critical section, don't expect good performance).
When you only need to update simple variables (e.g. 32 or 64 bit int or pointers), perform simply addition or subtraction operations on them or just swap the values of two variables, most platforms offer "atomic operations" for that (further GCC offers these as well). Atomic is not the same as thread-safe. However, atomic makes sure, that if one thread writes a 64 bit value to a memory location for example and another thread reads from it, the reading one either gets the value before the write operation or after the write operation, but never a broken value in-between the write operation (e.g. one where the first 32 bit are already the new, the last 32 bit are still the old value! This can happen if you don't use atomic access on such a variable).
However, if you have a C struct with 3 values, that want to update, even if you update all three with atomic operations, these are three independent operations, thus a reader might see the struct with one value already being update and two not being updated. Here you will need a lock if you must assure, the reader either sees all values in the struct being either the old or the new values.
One way to make locks scale a lot better is using R/W locks. In many cases, updates to data are rather infrequent (write operations), but accessing the data is very frequent (reading the data), think of collections (hashtables, trees). In that case R/W locks will buy you a huge performance gain, as many threads can hold a read-lock at the same time (they won't block each other) and only if one thread wants a write lock, all other threads are blocked for the time the update is performed.
The best way to avoid thread-issues is to not share any data across threads. If every thread deals most of the time with data no other thread has access to, you won't need locking for that data at all (also no atomic operations). So try to share as little data as possible between threads. Then you only need a fast way to move data between threads if you really have to (ITC, Inter Thread Communication). Depending on your operating system, platform and programming language (unfortunately you told us neither of these), various powerful methods for ITC might exist.
And finally, another trick to work with shared data but without any locking is to make sure threads don't access the same parts of the shared data. E.g. if two threads share an array, but one will only ever access even, the other one only odd indexes, you need no locking. Or if both share the same memory block and one only uses the upper half of it, the other one only the lower one, you need no locking. Though it's not said, that this will lead to good performance; especially not on multi-core CPUs. Write operations of one thread to this shared data (running one core) might force the cache to be flushed for another thread (running on another core) and these cache flushes are often the bottle neck for multithread applications running on modern multi-core CPUs.
As my professor (Nir Shavit from "The Art of Multiprocessor Programming") told the class: Please don't. The main reason is testability - you can't test synchronization code. You can run simulations, you can even stress test. But it's rough approximation at best. What you really need is mathematical correctness proof. And very few capable understanding them, let alone writing them.
So, as others had said: use existing libraries. Joe Duffy's blog surveys some techniques (section 28). The first one you should try is tree-splitting - break to smaller tasks and combine.
Immutability is one approach to avoid locking. See Eric Lippert's discussion and implementation of things like immutable stacks and queues.
in re. Suma's answer, Maurice Herlithy shows in The Art of Multiprocessor Programming that actually anything can be written without locks (see chapter 6). iirc, This essentially involves splitting tasks into processing node elements (like a function closure), and enqueuing each one. Threads will calculate the state by following all nodes from the latest cached one. Obviously this could, in worst case, result in sequential performance, but it does have important lockless properties, preventing scenarios where threads could get scheduled out for long peroids of time when they are holding locks. Herlithy also achieves theoretical wait-free performance, meaning that one thread will not end up waiting forever to win the atomic enqueue (this is a lot of complicated code).
A multi-threaded queue / stack is surprisingly hard (check the ABA problem). Other things may be very simple. Become accustomed to while(true) { atomicCAS until I swapped it } blocks; they are incredibly powerful. An intuition for what's correct with CAS can help development, though you should use good testing and maybe more powerful tools (maybe SKETCH, upcoming MIT Kendo, or spin?) to check correctness if you can reduce it to a simple structure.
Please post more about your problem. It's difficult to give a good answer without details.
edit immutibility is nice but it's applicability is limited, if I'm understanding it right. It doesn't really overcome write-after-read hazards; consider two threads executing "mem = NewNode(mem)"; they could both read mem, then both write it; not the correct for a classic increment function. Also, it's probably slow due to heap allocation (which has to be synchronized across threads).
Inmutability would have this effect. Changes to the object result in a new object. Lisp works this way under the covers.
Item 13 of Effective Java explains this technique.
Cliff Click has dome some major research on lock free data structures by utilizing finite state machines and also posted a lot of implementations for Java. You can find his papers, slides and implementations at his blog: http://blogs.azulsystems.com/cliff/
Use an existing implementation, as this area of work is the realm of domain experts and PhDs (if you want it done right!)
For example there is a library of code here:
http://www.cl.cam.ac.uk/research/srg/netos/lock-free/
Most lock-free algorithms or structures start with some atomic operation, i.e. a change to some memory location that once begun by a thread will be completed before any other thread can perform that same operation. Do you have such an operation in your environment?
See here for the canonical paper on this subject.
Also try this wikipedia article article for further ideas and links.
The basic principle for lock-free synchronisation is this:
whenever you are reading the structure, you follow the read with a test to see if the structure was mutated since you started the read, and retry until you succeed in reading without something else coming along and mutating while you are doing so;
whenever you are mutating the structure, you arrange your algorithm and data so that there is a single atomic step which, if taken, causes the entire change to become visible to the other threads, and arrange things so that none of the change is visible unless that step is taken. You use whatever lockfree atomic mechanism exists on your platform for that step (e.g. compare-and-set, load-linked+store-conditional, etc.). In that step you must then check to see if any other thread has mutated the object since the mutation operation began, commit if it has not and start over if it has.
There are plenty of examples of lock-free structures on the web; without knowing more about what you are implementing and on what platform it is hard to be more specific.
If you are writing your own lock-free data structures for a multi-core cpu, do not forget about memory barriers! Also, consider looking into Software Transaction Memory techniques.
Well, it depends on the kind of structure, but you have to make the structure so that it carefully and silently detects and handles possible conflicts.
I doubt you can make one that is 100% lock-free, but again, it depends on what kind of structure you need to build.
You might also need to shard the structure so that multiple threads work on individual items, and then later on synchronize/recombine.
As mentioned, it really depends on what type of structure you're talking about. For instance, you can write a limited lock-free queue, but not one that allows random access.
Reduce or eliminate shared mutable state.
In Java, utilize the java.util.concurrent packages in JDK 5+ instead of writing your own. As was mentioned above, this is really a field for experts, and unless you have a spare year or two, rolling your own isn't an option.
Can you clarify what you mean by structure?
Right now, I am assuming you mean the overall architecture. You can accomplish it by not sharing memory between processes, and by using an actor model for your processes.
Take a look at my link ConcurrentLinkedHashMap for an example of how to write a lock-free data structure. It is not based on any academic papers and doesn't require years of research as others imply. It simply takes careful engineering.
My implementation does use a ConcurrentHashMap, which is a lock-per-bucket algorithm, but it does not rely on that implementation detail. It could easily be replaced with Cliff Click's lock-free implementation. I borrowed an idea from Cliff, but used much more explicitly, is to model all CAS operations with a state machine. This greatly simplifies the model, as you'll see that I have psuedo locks via the 'ing states. Another trick is to allow laziness and resolve as needed. You'll see this often with backtracking or letting other threads "help" to cleanup. In my case, I decided to allow dead nodes on the list be evicted when they reach the head, rather than deal with the complexity of removing them from the middle of the list. I may change that, but I didn't entirely trust my backtracking algorithm and wanted to put off a major change like adopting a 3-node locking approach.
The book "The Art of Multiprocessor Programming" is a great primer. Overall, though, I'd recommend avoiding lock-free designs in the application code. Often times it is simply overkill where other, less error prone, techniques are more suitable.
If you see lock contention, I would first try to use more granular locks on your data structures rather than completely lock-free algorithms.
For example, I currently work on multithreaded application, that has a custom messaging system (list of queues for each threads, the queue contains messages for thread to process) to pass information between threads. There is a global lock on this structure. In my case, I don't need speed so much, so it doesn't really matter. But if this lock would become a problem, it could be replaced by individual locks at each queue, for example. Then adding/removing element to/from the specific queue would didn't affect other queues. There still would be a global lock for adding new queue and such, but it wouldn't be so much contended.
Even a single multi-produces/consumer queue can be written with granular locking on each element, instead of having a global lock. This may also eliminate contention.
If you read several implementations and papers regarding the subject, you'll notice there is the following common theme:
1) Shared state objects are lisp/clojure style inmutable: that is, all write operations are implemented copying the existing state in a new object, make modifications to the new object and then try to update the shared state (obtained from a aligned pointer that can be updated with the CAS primitive). In other words, you NEVER EVER modify an existing object that might be read by more than the current thread. Inmutability can be optimized using Copy-on-Write semantics for big, complex objects, but thats another tree of nuts
2) you clearly specify what allowed transitions between current and next state are valid: Then validating that the algorithm is valid become orders of magnitude easier
3) Handle discarded references in hazard pointer lists per thread. After the reference objects are safe, reuse if possible
See another related post of mine where some code implemented with semaphores and mutexes is (partially) reimplemented in a lock-free style:
Mutual exclusion and semaphores