When calling GEOS functions on GEOSGeometry objects in multiple threads I'm seeing only about a 2x speedup over sequential execution.
For my test computation I create a GEOSContextHandle_t, and then create many GEOSGeometry objects. Then in many threads I call a reentrant function like GEOSBufferWithStyle_r in a for loop on all of the polygons.
This is only about 2x faster on a four core machine (i5) over sequential operation. Is there anything I should be doing to improve parallelism?
I am doing this within Cython, properly wrapping code in nogil blocks.
Related
From wikipedia the
paragraph Comparison with threads states:
... This means that coroutines provide concurrency but not parallelism ...
I understand that coroutine is lighter than thread, context switching is not involved, no critical sections so mutex is also not needed. What confuses me is that the way it works seems not to scale. According to wikipedia, coroutines provide concurrency, they work cooperatively. A program with coroutines still executes instructions sequentially, this is exactly the same as threads on a single core machine, but what about multicore machines? on which threads run in parallel, while coroutines work the same as on single core machines.
My question is how coroutines will perform better than threads on multicore machines?
...what about multicore machines?...
Coroutines are a model of concurrency (in which two or more stateful activities can be in-progress at the same time), but not a model of parallelism (in which the program would able to use more hardware resources than what a single, conventional CPU core can provide).
Threads can run independently of one another, and if your hardware supports it (i.e., if your machine has more than one core) then two or more threads can be performing their independent activities at the same instant in time.
But coroutines, by definition, are interdependent. A coroutine only runs when it is called by another coroutine, and the caller is suspended until the current coroutine calls it back. Only one coroutine from a set of coroutines can ever be actually running at any given instant in time.
At first, I create four threads and each of them will call a GPU function. However, within each of the four, I also want to create two threads. One is to read data from the disk and the other is to do the computation. I am not sure if I can create a nested thread in C++. I think this is not a neat code. Can I have another way to solve the problem?
In general it should be no problem to create a new thread from a running thread.
Like you assume it's not the best solution, because creating/destroying threads often isn't cheap and the more threads you have the more context switches you have which is or can be also a performance penalty.
So you could create a thread pool which has given number of threads and let the thread pool threads work on reading data from disk and do computations. You would avoid massive creation and destroying of threads.
If you also create/destroy often the threads who are calling GPU functions you could create two threadpools one for the threads calling the GPU functions and one threadpool for reading from disk and computations.
You could use std::async and do away with thread management entirely. Or use a hybrid approach where you have the 4 core threads which I assume will never die and then in those functions where you wish to perform more asynchronous work you can use std::async.
https://solarianprogrammer.com/2012/10/17/cpp-11-async-tutorial/
It's not clear whether async tasks are using thread pools. If you want to ensure high performance, which you probably care about since you're using GPUs, you should use a thread pool.
http://roar11.com/2016/01/a-platform-independent-thread-pool-using-c14/
I am modelling and solving a nonlinear program (NLP) using single-threaded CPLEX with AMPL (I am constraining CPLEX to use only one thread explicitly) in CentOS 7. I am using a processor with 6 independent cores (intel i7 8700) to solve 6 independent test instances.
When I run these tests sequentially, it is much faster than when I run these 6 instances concurrenctly (about 63%) considering time elapsed. They are executed in independent processes, reading distinct data files, and writting results in distinct output files. I have also tried to solve these tests sequentially with multithread, and I got similar times to those cases with only one thread sequentially.
I have checked the behaviour of these processes using top/htop. They get different processors to execute. So my question is how the execution of these tests concurrently would get so much impact on time elapsed if they are solving in different cores with only one thread and they are individual processes?
Any thoughts would be appreciated.
It's very easy to make many threads perform worse than a single thread. The key to successful multi-threading and speedup is to understand not just the fact that the program is multi-threaded, but to know exactly how your threads interact. Here are a few questions you should ask yourself as you review your code:
1) Do the individual threads share resources? If so what are those resources and when you are accessing them do they block other threads?
2) What's the slowest resource your multi-threaded code relies on? A common bottleneck (and oft neglected) is disk IO. Multiple threads can process data much faster but they won't make a disk read faster and in many cases multithreading can make it much worse (e.g. thrashing).
3) Is access to common resources properly synchronized?
To this end, and without knowing more about your problem, I'd recommend:
a) Not reading different files from different threads. You want to keep your disk IO as sequential as possible and this is easier from a single thread. Maybe batch read files from a single thread and then farm them out for processing.
b) Keep your threads as autonomous as possible - any communication back and forth will cause thread contention and slow things down.
Are goroutines roughly equivalent to python's asyncio tasks, with an additional feature that any CPU-bound task is routed to a ThreadPoolExecutor instead of being added to the event loop (of course, with the assumption that we use a python interpreter without GIL)?
Is there any substantial difference between the two approaches that I'm missing? Of course, apart from the efficiencies and code clarity that result from the concurrency being an integral part of Go.
I think I know part of the answer. I tried to summarize my understanding of the differences, in order of importance, between asyncio tasks and goroutines:
1) Unlike under asyncio, one rarely needs to worry that their goroutine will block for too long. OTOH, memory sharing across goroutines is akin to memory sharing across threads rather than asyncio tasks since goroutine execution order guarantees are much weaker (even if the hardware has only a single core).
asyncio will only switch context on explicit await, yield and certain event loop methods, while Go runtime may switch on far more subtle triggers (such as certain function calls). So asyncio is perfectly cooperative, while goroutines are only mostly cooperative (and the roadmap suggests they will become even less cooperative over time).
A really tight loop (such as with numeric computation) could still block Go runtime (well, the thread it's running on). If it happens, it's going to have less of an impact than in python - unless it occurs in mutliple threads.
2) Goroutines are have off-the-shelf support for parallel computation, which would require a more sophisticated approach under asyncio.
Go runtime can run threads in parallel (if multiple cores are available), and so it's somewhat similar to running multiple asyncio event loops in a thread pool under a GIL-less python runtime, with a language-aware load balancer in front.
3) Go runtime will automatically handle blocking syscalls in a separate thread; this needs to be done explicitly under asyncio (e.g., using run_in_executor).
That said, in terms of memory cost, goroutines are very much like asyncio tasks rather than threads.
I suppose you could think of it working that way underneath, sure. It's not really accurate, but, close enough.
But there is a big difference: in Go you can write straight line code, and all the I/O blocking is handled for you automatically. You can call Read, then Write, then Read, in simple straight line code. With Python asyncio, as I understand it, you need to queue up a function to handle the reads, rather than just calling Read.
The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big performance hit.
I was wondering, is there a similar penalty for doing this in openmp? For example, say I have a 6 core processor and a program with 6 threads. If I have a conditional that makes 3 threads perform a certain task, and then have the other three threads perform a completely different task, will there be a big performance hit? I guess in essence it's sort of using openmp to do MIMD.
Basically, I'm writing a program with openmp and CUDA. I want two threads to run a CUDA kernel while the other left over threads run C code. Thanks.
No, there is no performance hit for diverging threads using OpenMP. It is a problem in CUDA because of the way instructions are broadcast simultaneously to a set of cores. When an OpenMP thread targets a CPU core, each CPU core has its own independent set of instructions to follow, and it runs just like any other single-threaded program would.
You may see some of your cores being underutilized if you have synchronization barriers following thread divergence, because that would force faster threads to wait for the slower threads to catch up.
When speaking about CPU parallelism, there's no intrinsic performance hit from using a certain threading design pattern. Not at the theoretical level at least.
The only problem I see is that since the threads are doing different things which may have varying completion times, some of the threads may sit idle after finishing their work, waiting for the others to finish a longer task.
The term thread divergence in CUDA refers to the situation when not all threads of a bock evaluate a conditional with the same outcome. Such threads are said to diverge. If diverging threads are in the same warp then such threads may perform work serially which leads to performance loss.
I am not sure that OpenMP has the same issue, though. When different threads perform different work then load balancing may be used by the runtime perhaps, but it doesn't lead to work serialization necessarily.
there is no this kind of problem in openmp because every openmp thread has its own PC.