Make focus stacking work faster in Python - multithreading

I am trying to run tufuse from python using subprocess.call to merge several layers of images and create one focus stack image. The input images are huge and take 20 min on my PC (12 cores, 64 G RAM) to do the job. I want to use multiprocessing or multi-threading or GPU computation to reduce this time. However none of the solutions I tried did work. As far as I understood these methods work on algebraic functions not with subprocess.call. Do you have any idea how to make this task runs faster?

Related

Results of Kernal PCA and LLE are different for different number of CPU cores provided for the run

I am doing dimensionality reduction using Scikit-Learns's KPCA and sometimes LLE APIs.
I have dataset which has a shape of around (700X150) all numerical.
I am just trying to pass this data to one of the above mentioned APIs to reduce its features, I have written a simple python script(say run.py) for it which I can run from terminal, that also saves the data after reduction.
What issue I am facing is, I am using "taskset" command in linux terminal to assign certain number of CPUs for a particular run. I can give any number of CPUs out of how much I have on my machine, for example, the terminal command could be:
taskset -c 1-3 python run.py when I want to give 3 cores
or taskset -c 1-2 python run.py when I want to use just 2 cores.
or simply just python run.py when I do not want to specify any CPU.
The problem is I am getting different results in all the three cases, by different results i mean output data of there three runs are different from one another, which should not happen since I using the script, same input data, and same algorithm(either KPCA or LLE) for all the three runs, I have also kept 'n_jobs' parameter to 2 because I am at least using 2 CPUs when I am using taskset. I have also supplied a random_state. All these 3 results are totally reproducible fortunately, that means the 1st command(with 3 cores) will produce same output data on every run, similarly 2nd and 3rd command also produces same results in each of their respective runs if run multiple times.
But the question why are these output different from each other ?
Setting up the taskset in my run is important for me because I am using a multi-core machine and I need to schedule different CPUs for different tasks, sometimes I have 2, sometimes I have 3, sometimes n number of CPUs for the same task which I give them accordingly but I don't want the results to be different based on how many CPUs I gave, this is affecting my classification performance as well which is later in the pipeline.
Also, done some experiments , I don't see this behavior when I use Isomap for reducing my data. The results are same doesn't matter how many CPUs I give.
I also used "numactl" command in place of "taskset" but the behavior was same.
Surprisingly, we could also see this same behaviour when using kpca function in R language! When I use R do to the same thing. Is there anything common and fundamental here regarding KPCA that I am missing ?
Please help.
Thanks,
Pranay
There might be something interesting in understanding exactly how the results differ. Algorithms like LLE, PCA and k-PCA that have a matrix factorization that has a sign ambiguity (e.g. in PCA, you can negate the component vectors and negate the coefficients and have the "same" answer). I'm not exactly what approach is being used for that matrix factorization, and what role randomization plays in that, and how it varies when it is parallelized, but it doesn't surprise me that it might be different when the computation is split across more processors, even with the same random seed.
TL;DR: If the results are different just in that some coordinates are negated, that isn't surprising. If they are more different than that, then I don't have a good answer.

Will a 8 CPUs Cloud Machine run 8x faster than a 1 CPU CM without changes in the code?

I am a beginner and I have no clue, yet, about cloud computing nor multithreading nor multiprocessing.
I have a desktop PC with an i7 (4 cores) and I was wondering if a multi-CPUs cloud machine OR an 8+ cores machine would run ANY CODE faster than my PC without any changes in the code.
Does the machine handle the tasks distribution on the several CPUs (or the 8+ cores) by itself or is it required to adapt the code? (multithreading or multiprocessing)
For the sake of argument, let say I run a simple loop like below:
results = {}
for i in range(10**8):
results[i] = i**2
This takes about 67 sec on my PC (I was running something else at the same time so I'm not sure this is accurate but my timing is irrelevant anyway).
Would the exact same code be faster on a multi-CPUs machine or an 8+ cores machine compare to a single CPU 4cores machine?
If it is, in fact, required to make changes, I would appreciate any beginner links to learn about multiprocess or multithread.
Thank you for your help.
I'm no expert but I think it really depends on the platform that you're using to write and run your code. Some languages may support multi-threading/multi-processing natively and as such the code will run faster but others might not.
One thing is for certain you can't explicitly say that in %100 of the cases a machine with more cores/CPUs will run a given piece of code faster than a machine with lesser cores/CPUs.
Hope I helped clear things up.
Edit:
This medium post regarding multiprocessing\multithreading in python looks good - Multithreading vs Multiprocessing in Python 🐍
Python multiprocessing for dummies
A code will run faster only when it is written in parallel fashion. The snippet in the text of your question is not written parallel, so it won't run any faster.
When parallel program is being written, the programmer keeps in mind the target level of parallelization. A sequential program has parallelization level = 1. A program with N CPU-intensive threads would run most effectively on N processors (cores). A program with high parallelization level may execute slower on 2-4 core machine than sequential variant.

Multiple python calls from bash but no speed-up

I want to run a Python3 process multiple times with different hyperparameters. To fully utilize the available CPU's, I want to spawn the process multiple times. However, I hardly observe any speed-up in practice. Below I will reproduce a small test that illustrates the effect.
First a Python test script:
(speed_test.py)
import numpy as np
import time
now = time.time()
for i in range(50):
np.matmul(np.random.rand(1000,1000),np.random.rand(1000,1000))
print(round(time.time()-now,1))
A single call: python3 speed_test.py prints 10.0 seconds.
However, when I try to run 2 processes in parallel:
python3 speed_test.py & python3 speed_test.py & wait prints 18.6 18.9.
parallel python3 speed_test.py ::: {1..2} prints 18.3 18.7.
It seems as if parallelization hardly buys me anything here (two executions in almost twice the time). I know I can't expect a linear speed-up, but this seems to be very little difference. My system has 1 socket with 2 cores per socket and 2 threads per core (4 CPUs in total). I see the same effect on a 8 CPU Google Cloud instance. Roughly, the computational time improves no more than ~10-20% per process, when running in parallel.
Finally, pinning CPUs to processes does not help much either:
taskset -c 0-1 python3 speed_test.py & taskset -c 2-3 python3 speed_test.py & wait prints 17.1 17.8
I thought each Python process could only utilize 1 CPU due to the Global Interpreter Lock. Is there anyway to speed-up my code?
Thanks for the reply #TomFenech, I should have added the CPU usage information indeed:
Local (4 vCPU): Single call = ~390%, double call ~190-200% each
Google cluster (8 vCPUs): single call ~400%, double call ~400% each (as expected)
Conclusion of toy-example: You are right. When I call htop, I actually see 4 processes per started job, not 1. So the job is internally distributing itself. I think this is related, distributing happens for (matrix) multiplication by BLAS/MKL.
Continuation for true job: So, the above toy-example was actually more involved and not a perfect case for my true script. My true (machine learning) script only partially relies on Numpy (not for matrix multiplication), but most heavy computation is performed in PyTorch. When I call my script locally (4 vCPU), it uses ~220% CPU. When I call that script on the Google Cloud cluster (8 vCPU), it - suprisingly - gets even up to ~700% (htop indeed shows 7-8 processes). So PyTorch seems to be doing an even better job at distributing itself.
(The Numpy BLAS version can be retrieved with np.__config__.show(). My local Numpy uses OpenBlas, the Google cluster uses MKL (Conda installation). I can't find a similar command to check for the BLAS version of PyTorch, but assume it uses the same.)
In general, the conclusion seems that both Numpy and PyTorch itself already take care of distributing code when it comes to matrix multiplication (and all CPUs are locally visible, i.e. no cluster/server setting). Therefore, if most of your script is matrix multiplication, then there is less reason than (at least I) expected to distribute scripts yourself.
However, not all of my code is matrix multiplication. Therefore, in theory I should still be able to get a speed-up from parallel processes. I wrote a new test, with 50/50 linear and matrix multiplication code:
(speed_test2.py)
import time
import torch
import random
now = time.time()
for i in range(12000):
[random.random() for k in range(10000)]
print('Linear time',round(time.time()-now,1))
now = time.time()
for j in range(350):
torch.matmul(torch.rand(1000,1000),torch.rand(1000,1000))
print('Matrix time',round(time.time()-now,1))
Running this on Google Cloud (8 vCPU):
Single process gives Linear time 12.6, Matrix time 9.2. (CPU during first part 100%, second part 500%)
Parallel process python3 speed_test2.py & python3 speed_test2.py gives Linear time 12.6, Matrix time 15.4 for both processes.
Adding a third process gives Linear time ~12.7, Matrix time 25.2
Conclusion: Although there are 8 vCPU here, the Pytorch/matrix (second) part of the code actually gets slower with more than 2 processes. The linear part of the code does of course increase (up to 8 parallel processes). I think this altogether explains why in practice, Numpy/PyTorch code may not show that much improvement when you start multiple concurrent processes. And that it may not always be beneficial to naively start 8 processes when you see 8 vCPUs. Please correct me if I am wrong somewhere here.

How to use Rmpi in R on linux Cluster to increase cores available with DEoptim?

I am using code developed in R to calibrate a hydrological model with 8 parameters using DEoptim (a function that aims to minimise an objective function). The DEoptim code uses the 'parallel' package to detect the number of cores available using 'DetectCores()'. On my PC I have 4 cores with 2 threads each so it detects 8 cores and then sends out the hydrological model to a core with different values of parameters and the results are returned to the centre. It does this hundreds or thousands of times and iterates the parameters to try and find an optimum set. Therefore the more cores available, the faster it will work.
I am at a university and have access to a Linux compute cluster. They have servers with up to 12 cores (i.e. not threads) and if I used this it would work two - three times faster than my PC. Great. However, ideally I would spread the code around other servers so I could have access to more cores and all the info sent back the master.
Therefore, my question is how could I include Rmpi in my code to effectively increase the cores available. As you can probably tell, I am quite new to using clusters.
Many thanks, Antony
If you want to execute DEoptim on multiple nodes of a Linux cluster, I believe you'll need to use foreach by specifying parallelType=2 in the control argument. You can use either the doMPI parallel backend or the doParallel backend with an MPI cluster object. For example:
library(doParallel)
library(Rmpi)
cl <- makeCluster(mpi.universe.size()-1, type='MPI')
registerDoParallel(cl)
# and eventually...
DEoptim(fn=Genrose, lower=rep(-25, n), upper=rep(25, n),
control=list(NP=10*n, itermax=maxIt, parallelType=2))
You'll need to have the snow package installed in addition to the others. Also, make sure that you execute your script with mpirun using the -np 1 option. If you don't use mpirun, the workers will all be spawned on the local machine.

Matlab 2011a Use all Cores Available on 64 bit Linux?

Hi I've looked online but I can't seem to find the answer whether I need to do anything to make matlab use all cores? From what I understand multi-threading has been supported since 2007. On my machine matlab only uses one core #100% and the rest hang at ~2%. I'm using a 64 bit Linux (Mint 12). On my other computer which has only 2 cores and is 32 bit Matlab seems to be utilizing both cores #100%. Not all of the time but in sufficient number of cases. On the 64 bit, 4 core PC this never happens.
Do I have to do anything in 64 bit to get Matlab to use all the cores whenever possible? I had to do some custom linking after install as Matlab wasn't finding the libraries (eg. libc.so.6) because it wasn't looking in the correct places.
By standard, since the latest release, you can use 12 cores using the Parallel Computing Toolbox. Without this toolbox, I guess you're out of luck. Any additional cores could be accessed by the MATLAB Distributed Computing Server, where you actually pay per number of worker threads.
To make matlab use your multiple cores you have to do
matlabpool open
And it of course works better if you actually have multithreaded code (like using the spmd function or parfor loops)
More info at the Matlab homepage
MATLAB has only one single thread for Computation.
That said, multiple threads would be created for certain functions which use the multithreaded features of the BLAS libraries that it uses underneath.
Thus, you would only be able to gain a 'multi threaded' advantage if you are calling functions which use these multi-threaded blas libraries.
This link has information on the list of functions which are multithreaded.
Now for the use of your cores, that would depend on your OS. I believe the OS would have to load balance your threads to be used on all cores. One CANNOT set affinities to threads from within MATLAB. One can however set worker MATLAB processes to have affinities to cores from within the Parallel Computing toolbox.
However, you could always try setting the affinity for the MATLAB process to all your processors manually by the details available at the following link for Linux
Windows users can simply right click on the process in the task manager and set affinity.
My understanding is that this is only a request to the OS and is not a hard binding rule that the OS must adhere to.

Resources