Multithreading in AWS - multithreading

I am a new user of AWS EC2 and I am going to deploy a mostly IO-bounded application on a linux-based EC2 m4.large instance. As far as I can read on AWS Instances sheet, available here, I have 2 vCPUs, which means I have two hyperthreads running on 1 physical CPU. Therefore, my question and my doubts deal with multithreading processing. According to me, the maximum number of threads I d be able to use should be 2, but i was wondering if there were any guidelines about multithreading computing on AWS instances. Basically, my application reads a big file (1.5+ GB) and then it needs to process its chunks. I was thinking of implementing either a producer-consumer pattern (1 thread reading and 1 processing) or using a map-like approach (every thread opens the file and seeks on its partition). I know that these two approaches may have different complexities but I am interested in performances, thus, I need to squeeze as much speed as possible!! Thank you in advance.

If your application is IO bound, using multithreaded processing is likely to be of limited utility, since multithreading is primarily useful for optimizing computation rather than IO. However, if you really want to get every last bit of speed, your best bet is to program it both ways and see which works better under your particular circumstances.

Related

Multithreading vs Shared Memory

I have a problem which is essentially a series of searches for multiple copies of items (needles) in a massive but in memory database (10s of Gb) - the haystack.
This is divided into tasks where each task is to find each of a series of needles in the haystack
and each task is logically independent from the other tasks.
(This is already distributed across multiple machines where each machine
has its own copy of the haystack.)
There are many ways this could be parallelized on individual machines.
We could have one search process per CPU core sharing memory.
Or we could have one search process with multiple threads (one per core). Or even several multi-threaded processes.
3 possible architectures:
A process loads the haystack into Posix shared memory.
Subsequent processes use the shared memory segment instead (like a cache)
A process loads the haystack into memory and then forks.
Each process uses the same memory because of copy on write semantics.
A process loads the haystack into memory and spawns multiple search threads
The question is one method likely to be better and why? or rather what are the trade offs.
(For argument's sake assume performance trumps implementation complexity).
Implementing two or three and measuring is possible of course but hard work.
Are there any reasons why one might be definitively better?
Data in the haystack is immutable.
The processes are running on Linux. So processes are not significantly more expensive than threads.
The haystack spans many GBs so CPU caches are not likely to help.
The search process is essentially a binary search (actually equal_range with a touch of interpolation).
Because the tasks are logically independent there is no benefit from inter-thread communication being
cheaper than inter-process communication (as per for example https://stackoverflow.com/a/18114475/1569204).
I cannot think of any obvious performance trade-offs between threads and shared memory here. Are there any? Perhaps the code maintenance trade-offs are more relevant?
Background research
The only relevant SO answer I could find refers to the overhead of synchronising threads - Linux: Processes and Threads in a Multi-core CPU - which is true but less applicable here.
Related and interesting but different questions are:
Multithreading: What is the point of more threads than cores?
Performance difference between IPC shared memory and threads memory
performance - multithreaded or multiprocess applications
An interesting presentation is https://elinux.org/images/1/1c/Ben-Yossef-GoodBadUgly.pdf
It suggests there can be a small difference in the speed of thread vs process context switches.
I am assuming that except for a monitoring threads/process the others are never switched out.
General advise: Be able to measure improvements! Without that, you may tweak all you like based on advise off the internet but still don't get optimal performance. Effectively, I'm telling you not to trust me or anyone else (including yourself) but to measure. Also prepare yourself for measuring this in real time on production systems. A benchmark may help you to some extent, but real load patterns are still a different beast.
Then, you say the operations are purely in-memory, so the speed doesn't depend on (network or storage) IO performance. The two bottlenecks you face are CPU and RAM bandwidth. So, in order to work on the right part, find out which is the limiting factor. Making sure that the according part is efficient ensures optimal performance for your searches.
Further, you say that you do binary searches. This basically means you do log(n) comparisons, where each comparison requires a load of a certain element from the haystack. This load probably goes through all caches, because the size of the data makes cache hits very unlikely. However, you could hold multiple needles to search for in cache at the same time. If you then manage to trigger the cache loads for the needles first and then perform the comparison, you could reduce the time where either CPU or RAM are idle because they wait for new operations to perform. This is obviously (like others) a parameter you need to tweak for the system it runs on.
Even further, reconsider binary searching. Binary searching performs reliably with a good upper bound on random data. If you have any patterns (i.e. anything non-random) in your data, try to exploit this knowledge. If you can roughly estimate the location of the needle you're searching for, you may thus reduce the number of lookups. This is basically moving the work from the RAM bus to the CPU, so it again depends which is the actual bottleneck. Note that you can also switch algorithms, e.g. going from an educated guess to a binary search when you have less than a certain amount of elements left to consider.
Lastly, you say that every node has a full copy of your database. If each of the N nodes is assigned one Nth of the database, it could improve caching. You'd then make one first step at locating the element to determine the node and then dispatch the search to the responsible node. If in doubt, every node can still process the search as a fallback.
The modern approach is to use threads and a single process.
Whether that is better than using multiple processes and a shared memory segment might depend somewhat on your personal preference and how easy threads are to use in the language you are using, but I would say that if decent thread support is available (e.g. Java) you are pretty much always better off using it.
The main advantage of using multiple processes as far as I can see is that it is impossible to run into the kind of issues you can get when managing multiple threads (e.g., forgetting to synchronise access to shared writable resources - except for the shared memory pool). However, thread-safety by not having threads at all is not much of an argument in favour.
It might also be slightly easier to add processes than add threads. You would have to write some code to change the number of processing threads online (or use a framework or application server).
But overall, the multiple-process approach is dead. I haven't used shared memory in decades. Threads have won the day and it is worth the investment to learn to use them.
If you do need to have multi-threaded access to common writable memory then languages like Java give you all sorts of classes for doing that (as well as language primitives). At some point you are going to find you want that and then with the multi-process approach you are faced with synchronising using semaphores and writing your own classes or maybe looking for a third party library, but the Java people will be miles ahead by then.
You also mentioned forking and relying on copy-on-write. This seems like a very fragile solution dependent on particular behaviour of the system and I would not myself use it.

Concurrent processes a lot slower than single process

I am modelling and solving a nonlinear program (NLP) using single-threaded CPLEX with AMPL (I am constraining CPLEX to use only one thread explicitly) in CentOS 7. I am using a processor with 6 independent cores (intel i7 8700) to solve 6 independent test instances.
When I run these tests sequentially, it is much faster than when I run these 6 instances concurrenctly (about 63%) considering time elapsed. They are executed in independent processes, reading distinct data files, and writting results in distinct output files. I have also tried to solve these tests sequentially with multithread, and I got similar times to those cases with only one thread sequentially.
I have checked the behaviour of these processes using top/htop. They get different processors to execute. So my question is how the execution of these tests concurrently would get so much impact on time elapsed if they are solving in different cores with only one thread and they are individual processes?
Any thoughts would be appreciated.
It's very easy to make many threads perform worse than a single thread. The key to successful multi-threading and speedup is to understand not just the fact that the program is multi-threaded, but to know exactly how your threads interact. Here are a few questions you should ask yourself as you review your code:
1) Do the individual threads share resources? If so what are those resources and when you are accessing them do they block other threads?
2) What's the slowest resource your multi-threaded code relies on? A common bottleneck (and oft neglected) is disk IO. Multiple threads can process data much faster but they won't make a disk read faster and in many cases multithreading can make it much worse (e.g. thrashing).
3) Is access to common resources properly synchronized?
To this end, and without knowing more about your problem, I'd recommend:
a) Not reading different files from different threads. You want to keep your disk IO as sequential as possible and this is easier from a single thread. Maybe batch read files from a single thread and then farm them out for processing.
b) Keep your threads as autonomous as possible - any communication back and forth will cause thread contention and slow things down.

What is the advantage (if any) of MPI + threads parallelization vs. MPI-only?

Given a cluster of several nodes, each of which hosts multiple-core processor, is there any advantage of using MPI between nodes and OpenMP/pthreads within nodes over using pure all-MPI? If I understand correctly, if I run an MPI-program on a single node and indicate the number of processes equal to the number of cores, then I will have an honest parallel MPI-job of several processes running on separate cores. So why bother about hybrid parallelization using threads within nodes and MPI only between nodes? I have no question in case of MPI+CUDA hybrid, as MPI cannot employ GPUs, but it can employ CPU cores, so why use threads?
Using a combination of OpenMP/pthread threads and MPI processes is known as Hybrid Programming. It is tougher to program than pure MPI but with the recent reduction in latencies with OpenMP, it makes a lot of sense to use Hybrid MPI. Some advantages are:
Avoiding data replication: Since threads can share data within a node, if any data needs to be replicated between processes, we can avoid this.
Light-weight : Threads are lightweight and thus you reduce the meta-data associated with processes.
Reduction in number of messages : A single process within a node can communicate with other processes, reducing number of messages between nodes (and thus reducing pressure on the Network Interface Card). The number of messages involved in collective communication is notable.
Faster communication : As pointed out by #user3528438 above, since threads communicate using shared memory, you can avoid using point-to-point MPI communication within a node. A recent approach (2012) recommends using RMA shared memory instead of threads within a node - this model is called MPI+MPI (search google scholar using MPI plus MPI).
But Hybrid MPI has its disadvantages as well but you asked only about the advantages.
This is in fact a much more complex question that it looks like.
It depends of lot of factor. By experience I would say: You are always happy to avoid hibrid openMP-MPI. Which is a mess to optimise. But there is some momement you cannot avoid it, mainly dependent on the problem you are solving and the cluster you have access to.
Let say you are solving a problem highly parallelizable and you have a small cluster then Hibrid will be probably useless.
But if you have a problem which lets says scale well up to N processes but start to have a very bad efficiency at 4N. And you have access to a cluster with 10N cores... Then hybridization will be a solution. You will use a little amount of thread per MPI processes something like 4 (It is known that >8 is not efficient).
(its fun to think that on KNL most people I know use 4 to 8 Thread per MPI process even if one chip got 68 cores)
Then what about hybrid accelerator/openMP/MPI.
You are wrong with accelerator + MPI. As soon as you start to used a cluster which has accelerators you will need to use someting like openMP/MPI or CUDA/MPI or openACC/MPI as you will need to communicate between devices. Nowadays you can bypass the CPU using Direct GPU (at least for Nvidia, not clue for other builder but I expect that it would be the case). Then usually you will use 1 MPI process per GPU. Most cluster with GPU will have 1 socket and N accelerators (N

Programming for Multi core Processors

As far as I know, the multi-core architecture in a processor does not effect the program. The actual instruction execution is handled in a lower layer.
my question is,
Given that you have a multicore environment, Can I use any programming practices to utilize the available resources more effectively? How should I change my code to gain more performance in multicore environments?
That is correct. Your program will not run any faster (except for the fact that the core is handling fewer other processes, because some of the processes are being run on the other core) unless you employ concurrency. If you do use concurrency, though, more cores improves the actual parallelism (with fewer cores, the concurrency is interleaved, whereas with more cores, you can get true parallelism between threads).
Making programs efficiently concurrent is no simple task. If done poorly, making your program concurrent can actually make it slower! For example, if you spend lots of time spawning threads (thread construction is really slow), and do work on a very small chunk size (so that the overhead of thread construction dominates the actual work), or if you frequently synchronize your data (which not only forces operations to run serially, but also has a very high overhead on top of it), or if you frequently write to data in the same cache line between multiple threads (which can lead to the entire cache line being invalidated on one of the cores), then you can seriously harm the performance with concurrent programming.
It is also important to note that if you have N cores, that DOES NOT mean that you will get a speedup of N. That is the theoretical limit to the speedup. In fact, maybe with two cores it is twice as fast, but with four cores it might be about three times as fast, and then with eight cores it is about three and a half times as fast, etc. How well your program is actually able to take advantage of these cores is called the parallel scalability. Often communication and synchronization overhead prevent a linear speedup, although, in the ideal, if you can avoid communication and synchronization as much as possible, you can hopefully get close to linear.
It would not be possible to give a complete answer on how to write efficient parallel programs on StackOverflow. This is really the subject of at least one (probably several) computer science courses. I suggest that you sign up for such a course or buy a book. I'd recommend a book to you if I knew of a good one, but the paralell algorithms course I took did not have a textbook for the course. You might also be interested in writing a handful of programs using a serial implementation, a parallel implementation with multithreading (regular threads, thread pools, etc.), and a parallel implementation with message passing (such as with Hadoop, Apache Spark, Cloud Dataflows, asynchronous RPCs, etc.), and then measuring their performance, varying the number of cores in the case of the parallel implementations. This was the bulk of the course work for my parallel algorithms course and can be quite insightful. Some computations you might try parallelizing include computing Pi using the Monte Carlo method (this is trivially parallelizable, assuming you can create a random number generator where the random numbers generated in different threads are independent), performing matrix multiplication, computing the row echelon form of a matrix, summing the square of the number 1...N for some very large number of N, and I'm sure you can think of others.
I don't know if it's the best possible place to start, but I've subscribed to the article feed from Intel Software Network some time ago and have found a lot of interesting thing there, presented in pretty simple way. You can find some very basic articles on fundamental concepts of parallel computing, like this. Here you have a quick dive into openMP that is one possible approach to start parallelizing the slowest parts of your application, without changing the rest. (If those parts present parallelism, of course.) Also check Intel Guide for Developing Multithreaded Applications. Or just go and browse the article section, the articles are not too many, so you can quickly figure out what suits you best. They also have a forum and a weekly webcast called Parallel Programming Talk.
Yes, simply adding more cores to a system without altering the software would yield you no results (with exception of the operating system would be able to schedule multiple concurrent processes on separate cores).
To have your operating system utilise your multiple cores, you need to do one of two things: increase the thread count per process, or increase the number of processes running at the same time (or both!).
Utilising the cores effectively, however, is a beast of a different colour. If you spend too much time synchronising shared data access between threads/processes, your level of concurrency will take a hit as threads wait on each other. This also assumes that you have a problem/computation that can relatively easily be parallelised, since the parallel version of an algorithm is often much more complex than the sequential version thereof.
That said, especially for CPU-bound computations with work units that are independent of each other, you'll most likely see a linear speed-up as you throw more threads at the problem. As you add serial segments and synchronisation blocks, this speed-up will tend to decrease.
I/O heavy computations would typically fare the worst in a multi-threaded environment, since access to the physical storage (especially if it's on the same controller, or the same media) is also serial, in which case threading becomes more useful in the sense that it frees up your other threads to continue with user interaction or CPU-based operations.
You might consider using programming languages designed for concurrent programming. Erlang and Go come to mind.

Multithreading in .NET 4.0 and performance

I've been toying around with the Parallel library in .NET 4.0. Recently, I developed a custom ORM for some unusual read/write operations one of our large systems has to use. This allows me to decorate an object with attributes and have reflection figure out what columns it has to pull from the database, as well as what XML it has to output on writes.
Since I envision this wrapper to be reused in many projects, I'd like to squeeze as much speed out of it as possible. This library will mostly be used in .NET web applications. I'm testing the framework using a throwaway console application to poke at the classes I've created.
I've now learned a lesson of the overhead that multithreading comes with. Multithreading causes it to run slower. From reading around, it seems like it's intuitive to people who've been doing it for a long time, but it's actually counter-intuitive to me: how can running a method 30 times at the same time be slower than running it 30 times sequentially?
I don't think I'm causing problems by multiple threads having to fight over the same shared object (though I'm not good enough at it yet to tell for sure or not), so I assume the slowdown is coming from the overhead of spawning all those threads and the runtime keeping them all straight. So:
Though I'm doing it mainly as a learning exercise, is this pessimization? For trivial, non-IO tasks, is multithreading overkill? My main goal is speed, not responsiveness of the UI or anything.
Would running the same multithreading code in IIS cause it to speed up because of already-created threads in the thread pool, whereas right now I'm using a console app, which I assume would be single-threaded until I told it otherwise? I'm about to run some tests, but I figure there's some base knowledge I'm missing to know why it would be one way or the other. My console app is also running on my desktop with two cores, whereas a server for a web app would have more, so I might have to use that as a variable as well.
Thread's don't actually all run concurrently.
On a desktop machine I'm presuming you have a dual core CPU, (maybe a quad at most). This means only 2/4 threads can be running at the same time.
If you have spawned 30 threads, the OS is going to have to context switch between those 30 threads to keep them all running. Context switches are quite costly, so hence the slowdown.
As a basic suggestion, I'd aim for 1 thread per CPU if you are trying to optimise calculations. Any more than this and you're not really doing any extra work, you are just swapping threads in an out on the same CPU. Try to think of your computer as having a limited number of workers inside, you can't do more work concurrently than the number of workers you have available.
Some of the new features in the .net 4.0 parallel task library allow you to do things that account for scalability in the number of threads. For example you can create a bunch of tasks and the task parallel library will internally figure out how many CPUs you have available, and optimise the number of threads is creates/uses so as not to overload the CPUs, so you could create 30 tasks, but on a dual core machine the TP library would still only create 2 threads, and queue the . Obviously, this will scale very nicely when you get to run it on a bigger machine. Or you can use something like ThreadPool.QueueUserWorkItem(...) to queue up a bunch of tasks, and the pool will automatically manage how many threads is uses to perform those tasks.
Yes there is a lot of overhead to thread creation, but if you are using the .net thread pool, (or the parallel task library in 4.0) .net will be managing your thread creation, and you may actually find it creates less threads than the number of tasks you have created. It will internally swap your tasks around on the available threads. If you actually want to control explicit creation of actual threads you would need to use the Thread class.
[Some cpu's can do clever stuff with threads and can have multiple Threads running per CPU - see hyperthreading - but check out your task manager, I'd be very surprised if you have more than 4-8 virtual CPUs on today's desktops]
There are so many issues with this that it pays to understand what is happening under the covers. I would highly recommend the "Concurrent Programming on Windows" book by Joe Duffy and the "Java Concurrency in Practice" book. The latter talks about processor architecture at the level you need to understand it when writing multithreaded code. One issue you are going to hit that's going to hurt your code is caching, or more likely the lack of it.
As has been stated there is an overhead to scheduling and running threads, but you may find that there is a larger overhead when you share data across threads. That data may be flushed from the processor cache into main memory, and that will cause serious slow downs to your code.
This is the sort of low-level stuff that managed environments are supposed to protect us from, however, when writing highly parallel code, this is exactly the sort of issue you have to deal with.
A colleague of mine recorded a screencast about the performance issue with Parallel.For and Parallel.ForEach which may help:
http://rocksolidknowledge.com/ScreenCasts.mvc/Watch?video=ParallelLoops.wmv
You're speaking of an ORM, so I presume some amount of I/O is going on. If this is the case, the overhead of thread creation and context switching is going to be comparatively non-existent.
Most likely, you're experiencing I/O contention: it can be slower (particularly on rotational hard drives, but also on other storage devices) to read the same set of data if you read it out of order than if you read it in-order. So, if you're executing 30 database queries, it's possible they'll run faster sequentially than in parallel if they're all backed by the same I/O device and the queries aren't in cache. Running them in parallel may cause the system to have a bunch of I/O read requests almost simultaneously, which may cause the OS to read little bits of each in turn - causing your drive head to jump back and forth, wasting precious milliseconds.
But that's just a guess; it's not possible to really determine what's causing your slowdown without knowing more.
Although thread creation is "extremely expensive" when compared to say adding two numbers, it's not usually something you'll easily overdo. If your operations are extremely short (say, a millisecond or less), using a thread-pool rather than new threads will noticeably save time. Generally though, if your operations are that short, you should reconsider the granularity of parallelism anyhow; perhaps you're better off splitting the computation into bigger chunks: for instance, by having a fairly low number of worker tasks which handle entire batches of smaller work-items at a time rather than each item separately.

Resources