Multiple python calls from bash but no speed-up - python-3.x

I want to run a Python3 process multiple times with different hyperparameters. To fully utilize the available CPU's, I want to spawn the process multiple times. However, I hardly observe any speed-up in practice. Below I will reproduce a small test that illustrates the effect.
First a Python test script:
(speed_test.py)
import numpy as np
import time
now = time.time()
for i in range(50):
np.matmul(np.random.rand(1000,1000),np.random.rand(1000,1000))
print(round(time.time()-now,1))
A single call: python3 speed_test.py prints 10.0 seconds.
However, when I try to run 2 processes in parallel:
python3 speed_test.py & python3 speed_test.py & wait prints 18.6 18.9.
parallel python3 speed_test.py ::: {1..2} prints 18.3 18.7.
It seems as if parallelization hardly buys me anything here (two executions in almost twice the time). I know I can't expect a linear speed-up, but this seems to be very little difference. My system has 1 socket with 2 cores per socket and 2 threads per core (4 CPUs in total). I see the same effect on a 8 CPU Google Cloud instance. Roughly, the computational time improves no more than ~10-20% per process, when running in parallel.
Finally, pinning CPUs to processes does not help much either:
taskset -c 0-1 python3 speed_test.py & taskset -c 2-3 python3 speed_test.py & wait prints 17.1 17.8
I thought each Python process could only utilize 1 CPU due to the Global Interpreter Lock. Is there anyway to speed-up my code?

Thanks for the reply #TomFenech, I should have added the CPU usage information indeed:
Local (4 vCPU): Single call = ~390%, double call ~190-200% each
Google cluster (8 vCPUs): single call ~400%, double call ~400% each (as expected)
Conclusion of toy-example: You are right. When I call htop, I actually see 4 processes per started job, not 1. So the job is internally distributing itself. I think this is related, distributing happens for (matrix) multiplication by BLAS/MKL.
Continuation for true job: So, the above toy-example was actually more involved and not a perfect case for my true script. My true (machine learning) script only partially relies on Numpy (not for matrix multiplication), but most heavy computation is performed in PyTorch. When I call my script locally (4 vCPU), it uses ~220% CPU. When I call that script on the Google Cloud cluster (8 vCPU), it - suprisingly - gets even up to ~700% (htop indeed shows 7-8 processes). So PyTorch seems to be doing an even better job at distributing itself.
(The Numpy BLAS version can be retrieved with np.__config__.show(). My local Numpy uses OpenBlas, the Google cluster uses MKL (Conda installation). I can't find a similar command to check for the BLAS version of PyTorch, but assume it uses the same.)
In general, the conclusion seems that both Numpy and PyTorch itself already take care of distributing code when it comes to matrix multiplication (and all CPUs are locally visible, i.e. no cluster/server setting). Therefore, if most of your script is matrix multiplication, then there is less reason than (at least I) expected to distribute scripts yourself.
However, not all of my code is matrix multiplication. Therefore, in theory I should still be able to get a speed-up from parallel processes. I wrote a new test, with 50/50 linear and matrix multiplication code:
(speed_test2.py)
import time
import torch
import random
now = time.time()
for i in range(12000):
[random.random() for k in range(10000)]
print('Linear time',round(time.time()-now,1))
now = time.time()
for j in range(350):
torch.matmul(torch.rand(1000,1000),torch.rand(1000,1000))
print('Matrix time',round(time.time()-now,1))
Running this on Google Cloud (8 vCPU):
Single process gives Linear time 12.6, Matrix time 9.2. (CPU during first part 100%, second part 500%)
Parallel process python3 speed_test2.py & python3 speed_test2.py gives Linear time 12.6, Matrix time 15.4 for both processes.
Adding a third process gives Linear time ~12.7, Matrix time 25.2
Conclusion: Although there are 8 vCPU here, the Pytorch/matrix (second) part of the code actually gets slower with more than 2 processes. The linear part of the code does of course increase (up to 8 parallel processes). I think this altogether explains why in practice, Numpy/PyTorch code may not show that much improvement when you start multiple concurrent processes. And that it may not always be beneficial to naively start 8 processes when you see 8 vCPUs. Please correct me if I am wrong somewhere here.

Related

why `local_rank` is zero in DDP even I set visible CUDA as 2?

There are 3 GPUs in my system.
I want to run on the last one i.e. 2. For this reason, I set gpu_id as 2 in my configuration file as well as CUDA_VISIBLE_DEVICES=2. But in my program, the following line always assigns the 0th GPU.
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
How to fix this issue?
When setting CUDA_VISIBLE_DEVICES=2 you tell the OS to only expose the third GPU to your process. That is, as far as PyTorch is concerned, there is only one GPU. Therefore torch.distributed.get_world_size() returns 1 (and not 3).
The rank of this GPU, in your process, will be 0 - since there are no other GPUs available for the process. But as far as the OS is concerned - all processing are done on the third GPU that was allocated to the job.

Results of Kernal PCA and LLE are different for different number of CPU cores provided for the run

I am doing dimensionality reduction using Scikit-Learns's KPCA and sometimes LLE APIs.
I have dataset which has a shape of around (700X150) all numerical.
I am just trying to pass this data to one of the above mentioned APIs to reduce its features, I have written a simple python script(say run.py) for it which I can run from terminal, that also saves the data after reduction.
What issue I am facing is, I am using "taskset" command in linux terminal to assign certain number of CPUs for a particular run. I can give any number of CPUs out of how much I have on my machine, for example, the terminal command could be:
taskset -c 1-3 python run.py when I want to give 3 cores
or taskset -c 1-2 python run.py when I want to use just 2 cores.
or simply just python run.py when I do not want to specify any CPU.
The problem is I am getting different results in all the three cases, by different results i mean output data of there three runs are different from one another, which should not happen since I using the script, same input data, and same algorithm(either KPCA or LLE) for all the three runs, I have also kept 'n_jobs' parameter to 2 because I am at least using 2 CPUs when I am using taskset. I have also supplied a random_state. All these 3 results are totally reproducible fortunately, that means the 1st command(with 3 cores) will produce same output data on every run, similarly 2nd and 3rd command also produces same results in each of their respective runs if run multiple times.
But the question why are these output different from each other ?
Setting up the taskset in my run is important for me because I am using a multi-core machine and I need to schedule different CPUs for different tasks, sometimes I have 2, sometimes I have 3, sometimes n number of CPUs for the same task which I give them accordingly but I don't want the results to be different based on how many CPUs I gave, this is affecting my classification performance as well which is later in the pipeline.
Also, done some experiments , I don't see this behavior when I use Isomap for reducing my data. The results are same doesn't matter how many CPUs I give.
I also used "numactl" command in place of "taskset" but the behavior was same.
Surprisingly, we could also see this same behaviour when using kpca function in R language! When I use R do to the same thing. Is there anything common and fundamental here regarding KPCA that I am missing ?
Please help.
Thanks,
Pranay
There might be something interesting in understanding exactly how the results differ. Algorithms like LLE, PCA and k-PCA that have a matrix factorization that has a sign ambiguity (e.g. in PCA, you can negate the component vectors and negate the coefficients and have the "same" answer). I'm not exactly what approach is being used for that matrix factorization, and what role randomization plays in that, and how it varies when it is parallelized, but it doesn't surprise me that it might be different when the computation is split across more processors, even with the same random seed.
TL;DR: If the results are different just in that some coordinates are negated, that isn't surprising. If they are more different than that, then I don't have a good answer.

Make focus stacking work faster in Python

I am trying to run tufuse from python using subprocess.call to merge several layers of images and create one focus stack image. The input images are huge and take 20 min on my PC (12 cores, 64 G RAM) to do the job. I want to use multiprocessing or multi-threading or GPU computation to reduce this time. However none of the solutions I tried did work. As far as I understood these methods work on algebraic functions not with subprocess.call. Do you have any idea how to make this task runs faster?

Will a 8 CPUs Cloud Machine run 8x faster than a 1 CPU CM without changes in the code?

I am a beginner and I have no clue, yet, about cloud computing nor multithreading nor multiprocessing.
I have a desktop PC with an i7 (4 cores) and I was wondering if a multi-CPUs cloud machine OR an 8+ cores machine would run ANY CODE faster than my PC without any changes in the code.
Does the machine handle the tasks distribution on the several CPUs (or the 8+ cores) by itself or is it required to adapt the code? (multithreading or multiprocessing)
For the sake of argument, let say I run a simple loop like below:
results = {}
for i in range(10**8):
results[i] = i**2
This takes about 67 sec on my PC (I was running something else at the same time so I'm not sure this is accurate but my timing is irrelevant anyway).
Would the exact same code be faster on a multi-CPUs machine or an 8+ cores machine compare to a single CPU 4cores machine?
If it is, in fact, required to make changes, I would appreciate any beginner links to learn about multiprocess or multithread.
Thank you for your help.
I'm no expert but I think it really depends on the platform that you're using to write and run your code. Some languages may support multi-threading/multi-processing natively and as such the code will run faster but others might not.
One thing is for certain you can't explicitly say that in %100 of the cases a machine with more cores/CPUs will run a given piece of code faster than a machine with lesser cores/CPUs.
Hope I helped clear things up.
Edit:
This medium post regarding multiprocessing\multithreading in python looks good - Multithreading vs Multiprocessing in Python 🐍
Python multiprocessing for dummies
A code will run faster only when it is written in parallel fashion. The snippet in the text of your question is not written parallel, so it won't run any faster.
When parallel program is being written, the programmer keeps in mind the target level of parallelization. A sequential program has parallelization level = 1. A program with N CPU-intensive threads would run most effectively on N processors (cores). A program with high parallelization level may execute slower on 2-4 core machine than sequential variant.

python: will memory_profiler affect runtime?

I am evaluating the tools that profile my python program. One of the interesting tools here is memory_profiler. Before moving forward, just want to know whethermemory_profiler affects runtime. The reason I am asking this question is that memory_profiler will output a lot of memory usages. So I am suspecting it might affect runtime.
Thanks
Derek
It depends how you are using memory_profiler. This can be used in two different ways:
To get memory usage line-by-line (run with python -m memory_profiler my_script.py). This needs to get memory information (from the OS) for every line executed within the profiled function. How this affects run-time depends on the amount of lines in the function: if it has a lot of lines with fast execution times, it might suppose a significant overhead. On the other hand, if the function to profile has few lines, and each lines has a significant computing time, then the overhead will be negligible.
To get memory as a function of time (run with mprof run my_script.py and plot with mprof plot). In this case the function that collects the memory usage is in a different process as the one that runs your script, hence the overhead is minimal (unless you are using all CPUs).

Resources