why `local_rank` is zero in DDP even I set visible CUDA as 2? - pytorch

There are 3 GPUs in my system.
I want to run on the last one i.e. 2. For this reason, I set gpu_id as 2 in my configuration file as well as CUDA_VISIBLE_DEVICES=2. But in my program, the following line always assigns the 0th GPU.
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
How to fix this issue?

When setting CUDA_VISIBLE_DEVICES=2 you tell the OS to only expose the third GPU to your process. That is, as far as PyTorch is concerned, there is only one GPU. Therefore torch.distributed.get_world_size() returns 1 (and not 3).
The rank of this GPU, in your process, will be 0 - since there are no other GPUs available for the process. But as far as the OS is concerned - all processing are done on the third GPU that was allocated to the job.

Related

Use PyTorch DistributedDataParallel with Hugging Face on Amazon SageMaker

Even for single-instance training, PyTorch DistributedDataParallel (DDP) is generally recommended over PyTorch DataParallel (DP) because DP's strategy is less performant and it uses more memory on the default device. (Per this PyTorch forums thread)
Hugging Face recommend to run distributed training via the python -m torch.distributed.launch launcher, because their Trainer API supports DDP but will fall back to DP if you don't. (Per this HF forums thread)
I recently ran in to this problem: scaling a HF training job from p3.8xlarge to p3.16xlarge increased memory consumption on (I think) one of the GPUs to the point where I had to significantly reduce batch size to avoid CUDA Out of Memory errors - basically losing all scaling advantage.
So the good news is for p3.16xl+ I can just enable SageMaker Distributed Data Parallel and the PyToch DLC will automatically launch via torch.distributed for me.
The bad news for use cases with smaller workloads or wanting to test before they scale up, is that SMDistributed doesn't support all multi-GPU instance types. No p3.8xl or g series, for example. I did try manually setting the sagemaker_distributed_dataparallel_enabled environment variable, but no joy.
So how else can we launch HF Trainer scripts with PyTorch DDP on SageMaker?
Great question, thanks for asking! PyTorch DDP runs data parallel workers in multiple processes, that must be launched and managed by developers. DDP should be seen as a managed allreduce, more than a managed data-parallelism library, since it requires you to launch and manage the workers and even assigning resources to workers. In order to launch the DDP processes in a SageMaker Training job you have many options:
If you do multi-GPU, single-machine, you can use torch.multiprocessing.spawn, as shown in this official PyTorch demo (that is broken by the way)
If you do multi-GPU, single-machine, you can also use the Ray Train library to launch those processes. I was able to use it in a Notebook, but not in the DLC yet (recent library that is a bit rough to learn and make work, see all my issues here). Ray Train should work on multi-node too.
If you do multi-GPU, any-machine, you can use torch.distributed.launch, wrapped in a launcher script in shell or Python. Example here https://gitlab.aws.dev/cruchant/a2d2-segmentation/-/blob/main/3_2D-Seg-Audi-A2D2-Distributed-Training-DDP.ipynb
You can also launch those processes with the SageMaker MPI integration instead of torch.distributed. Unfortunately, we didn't create documentation for this, so no one uses it nor pitches it. But it looks cool, because it allows to run copies of your script directly in the EC2 machines without the need to invoke an intermediary PyTorch launcher. Example here
So for now, my recommendation would be to go the route (3), which is the closest to what the PyTorch community does, so provides easier development and debugging path.
Notes:
PyTorch DDP evolves fast. In PT 1.10 torch.distributed is replaced by torchrun, and a torchX tool is being created to...simplify things!).
Not having to manage that mess is a reason why SageMaker Distributed Data Parallel is a great value prop: you only need to edit your script, and the SM service handles process creation. Unfortunately, as you point out, SMDP being limited to P3 and P4 training jobs seriously limits its use.
Below are important PT DDP concepts to understand to alter single-GPU code into multi-machine code
Unlike Apache Spark, which takes care of workload partitioning on your behalf, Pytorch distributed training requires the user to assign specific pieces of work to specific GPUs. In the following section, we assume that we train on GPU.
In PyTorch DDP, each GPU runs a customized copy of you training code. A copy of the training code running on one GPU is generally called a rank, a data parallel replica, a process, a worker, but other names may exist.
For PyTorch DDP to launch a training cluster on the MxN GPUs spread over your M machines, you must specify to PyTorch DDP the number of machines you have and the number of processes to launch per machine. This is respectively done by the parameters -nnodes and -nproc_per_node of the torch.distributed.launch utility. You must run the torch.distributed.lauch once on each node of the training cluster. You can achieve this parallel command with multiple tools, for example with MPI or SageMaker Training as mentioned above. In order to establish the necessary handshakes and form a cluster, you must also specify in the torch.distributed.launch command -node_rank, which must take a unique machine ID between 0 and N-1 on each of the machines, and -master_addr and -master_port, optional if you run a single-machine cluster, which must be the same across all machines.
In the init_process_group DDP initialization method running from within each data parallel replica script, you must specify the world size and replica ID, respectively with the world_size and rank parameters. Hence you must have a way to communicate to each script a unique ID, generally called the global rank. The global rank can help you personalize the work done by each GPU, for example saving a model just from one card, or running validation only in one card. In a cluster composed of 3 machines having 4 GPUs each, global ranks would range from 0 to 11. Within a machine, in order to assign DDP data parallel replicas to available GPUs, the script running in each replica must be assigned a GPU ID, unique within the machine it's running on. This is called the local rank and can be set as an argument by the PyTorch DDP torch.distributed.launch. In a cluster composed of 3 machines having 4 GPUs each, on each machine the DDP processes would have local ranks ranging from 0 to 3

Results of Kernal PCA and LLE are different for different number of CPU cores provided for the run

I am doing dimensionality reduction using Scikit-Learns's KPCA and sometimes LLE APIs.
I have dataset which has a shape of around (700X150) all numerical.
I am just trying to pass this data to one of the above mentioned APIs to reduce its features, I have written a simple python script(say run.py) for it which I can run from terminal, that also saves the data after reduction.
What issue I am facing is, I am using "taskset" command in linux terminal to assign certain number of CPUs for a particular run. I can give any number of CPUs out of how much I have on my machine, for example, the terminal command could be:
taskset -c 1-3 python run.py when I want to give 3 cores
or taskset -c 1-2 python run.py when I want to use just 2 cores.
or simply just python run.py when I do not want to specify any CPU.
The problem is I am getting different results in all the three cases, by different results i mean output data of there three runs are different from one another, which should not happen since I using the script, same input data, and same algorithm(either KPCA or LLE) for all the three runs, I have also kept 'n_jobs' parameter to 2 because I am at least using 2 CPUs when I am using taskset. I have also supplied a random_state. All these 3 results are totally reproducible fortunately, that means the 1st command(with 3 cores) will produce same output data on every run, similarly 2nd and 3rd command also produces same results in each of their respective runs if run multiple times.
But the question why are these output different from each other ?
Setting up the taskset in my run is important for me because I am using a multi-core machine and I need to schedule different CPUs for different tasks, sometimes I have 2, sometimes I have 3, sometimes n number of CPUs for the same task which I give them accordingly but I don't want the results to be different based on how many CPUs I gave, this is affecting my classification performance as well which is later in the pipeline.
Also, done some experiments , I don't see this behavior when I use Isomap for reducing my data. The results are same doesn't matter how many CPUs I give.
I also used "numactl" command in place of "taskset" but the behavior was same.
Surprisingly, we could also see this same behaviour when using kpca function in R language! When I use R do to the same thing. Is there anything common and fundamental here regarding KPCA that I am missing ?
Please help.
Thanks,
Pranay
There might be something interesting in understanding exactly how the results differ. Algorithms like LLE, PCA and k-PCA that have a matrix factorization that has a sign ambiguity (e.g. in PCA, you can negate the component vectors and negate the coefficients and have the "same" answer). I'm not exactly what approach is being used for that matrix factorization, and what role randomization plays in that, and how it varies when it is parallelized, but it doesn't surprise me that it might be different when the computation is split across more processors, even with the same random seed.
TL;DR: If the results are different just in that some coordinates are negated, that isn't surprising. If they are more different than that, then I don't have a good answer.

Multiple python calls from bash but no speed-up

I want to run a Python3 process multiple times with different hyperparameters. To fully utilize the available CPU's, I want to spawn the process multiple times. However, I hardly observe any speed-up in practice. Below I will reproduce a small test that illustrates the effect.
First a Python test script:
(speed_test.py)
import numpy as np
import time
now = time.time()
for i in range(50):
np.matmul(np.random.rand(1000,1000),np.random.rand(1000,1000))
print(round(time.time()-now,1))
A single call: python3 speed_test.py prints 10.0 seconds.
However, when I try to run 2 processes in parallel:
python3 speed_test.py & python3 speed_test.py & wait prints 18.6 18.9.
parallel python3 speed_test.py ::: {1..2} prints 18.3 18.7.
It seems as if parallelization hardly buys me anything here (two executions in almost twice the time). I know I can't expect a linear speed-up, but this seems to be very little difference. My system has 1 socket with 2 cores per socket and 2 threads per core (4 CPUs in total). I see the same effect on a 8 CPU Google Cloud instance. Roughly, the computational time improves no more than ~10-20% per process, when running in parallel.
Finally, pinning CPUs to processes does not help much either:
taskset -c 0-1 python3 speed_test.py & taskset -c 2-3 python3 speed_test.py & wait prints 17.1 17.8
I thought each Python process could only utilize 1 CPU due to the Global Interpreter Lock. Is there anyway to speed-up my code?
Thanks for the reply #TomFenech, I should have added the CPU usage information indeed:
Local (4 vCPU): Single call = ~390%, double call ~190-200% each
Google cluster (8 vCPUs): single call ~400%, double call ~400% each (as expected)
Conclusion of toy-example: You are right. When I call htop, I actually see 4 processes per started job, not 1. So the job is internally distributing itself. I think this is related, distributing happens for (matrix) multiplication by BLAS/MKL.
Continuation for true job: So, the above toy-example was actually more involved and not a perfect case for my true script. My true (machine learning) script only partially relies on Numpy (not for matrix multiplication), but most heavy computation is performed in PyTorch. When I call my script locally (4 vCPU), it uses ~220% CPU. When I call that script on the Google Cloud cluster (8 vCPU), it - suprisingly - gets even up to ~700% (htop indeed shows 7-8 processes). So PyTorch seems to be doing an even better job at distributing itself.
(The Numpy BLAS version can be retrieved with np.__config__.show(). My local Numpy uses OpenBlas, the Google cluster uses MKL (Conda installation). I can't find a similar command to check for the BLAS version of PyTorch, but assume it uses the same.)
In general, the conclusion seems that both Numpy and PyTorch itself already take care of distributing code when it comes to matrix multiplication (and all CPUs are locally visible, i.e. no cluster/server setting). Therefore, if most of your script is matrix multiplication, then there is less reason than (at least I) expected to distribute scripts yourself.
However, not all of my code is matrix multiplication. Therefore, in theory I should still be able to get a speed-up from parallel processes. I wrote a new test, with 50/50 linear and matrix multiplication code:
(speed_test2.py)
import time
import torch
import random
now = time.time()
for i in range(12000):
[random.random() for k in range(10000)]
print('Linear time',round(time.time()-now,1))
now = time.time()
for j in range(350):
torch.matmul(torch.rand(1000,1000),torch.rand(1000,1000))
print('Matrix time',round(time.time()-now,1))
Running this on Google Cloud (8 vCPU):
Single process gives Linear time 12.6, Matrix time 9.2. (CPU during first part 100%, second part 500%)
Parallel process python3 speed_test2.py & python3 speed_test2.py gives Linear time 12.6, Matrix time 15.4 for both processes.
Adding a third process gives Linear time ~12.7, Matrix time 25.2
Conclusion: Although there are 8 vCPU here, the Pytorch/matrix (second) part of the code actually gets slower with more than 2 processes. The linear part of the code does of course increase (up to 8 parallel processes). I think this altogether explains why in practice, Numpy/PyTorch code may not show that much improvement when you start multiple concurrent processes. And that it may not always be beneficial to naively start 8 processes when you see 8 vCPUs. Please correct me if I am wrong somewhere here.

How can a large number of assignments to the same array cause a pyopencl.LogicError when run on GPU?

I'm using pyOpenCL to do some complex calculations.
It runs fine on CPU, but I get an error when trying to run it on an NVIDIA GeForce 9400M (256 MB).
I'm working on Mac OS X Lion (10.7.5)
The strange thing is that this error does not always show up. It seems to occur when my calculations use larger numbers (resulting in larger iterations) but only when run on GPU.
I'm not writing to memory locations I'm not supposed to write to. I ruled out possible problems with concurrent modification by running the code as a single work item.
I simplified my OpenCL code as much as possible, and from what was left created some very simple code with extremely weird behavior that causes the pyopencl.LogicError. It consists of 2 nested loops in which a couple of assignments are made to the result array. This assignment need not even depend on the state of the loop.
This is run on a single thread (or work item, shape = (1,)) on the GPU.
__kernel void weirdError(__global unsigned int* result){
unsigned int outer = (1<<30)-1;
for(int i=20; i--; ){
unsigned int inner = 0;
while(inner != outer){
result[0] = 1248;
result[1] = 1337;
inner++;
}
outer++;
}
}
The strange part is that removing either one of the assignments to the result array removes the error. Also, decreasing the initial value for outer (down to (1<<20)-1 for example) also removes the error. In these cases, the code returns normally, with the correct result available in the corresponding buffer.
On CPU, it never raises an error.
The OpenCL code is run from Python using PyOpenCL.
Nothing fancy in the setup:
platform = cl.get_platforms()[0]
device = platform.get_devices(cl.device_type.GPU)[0]
context = cl.Context([device])
program = cl.Program(context, getProgramCode()).build()
queue = cl.CommandQueue(context)
In this Python code I set the result_buf to 0, then I run the calculation in OpenCL that will set its values in a large iteration. Afterwards I try to collect this value from the device memory, but that's where it goes wrong:
result = numpy.zeros(2, numpy.uint32)
result_buf = cl.Buffer(context, mem_flags.READ_WRITE | mem_flags.COPY_HOST_PTR, hostbuf=result)
shape = (1,)
program.weirdError(queue, shape, None, result_buf)
cl.enqueue_copy(queue, result, result_buf)
The last line gives me:
pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
How can this repeated assignment cause an error?
And more importantly: how can it be avoided?
I understand that this problem is probably platform dependent, and thus perhaps hard to reproduce. But this is the only machine I have access to, so the code should work on this machine.
DISCLAIMER: I have never worked with OpenCL (or CUDA) before. I wrote the code on a machine where the GPU did not support OpenCL. I always tested it on CPU. Now that I switched to GPU, I find it frustrating that errors do not occur consistently and I have no idea why.
My advice is to avoid such a long loops inside a kernel. Work Item is making over 1 billion of iterations, and that's a long shot. Probably, driver kills your kernel as it takes too much time to execute. Reduce the number of iterations to the maximal amount, which doesn't lead to error and look at the execution time. If it takes something like seconds - that's too much.
As you said, reducing iterations numbers solves the problem and that's the evidence in my opinion. Reducing the number of assignment operations also makes kernel runs faster as IO operations are usually the slowest.
CPU doesn't face such difficulties for obvious reasons.
This timeout problem can be fixed in Windows and Linux, but apparently not in Mac.
Windows
This answer to a similar question (explaining the symptoms in Windows) tells both what is going on and how to fix it:
This is a known "feature" under Windows (not sure about Linux) - if the video driver stops responding, the OS will reset it. Except that, since OpenCL (and CUDA) is implemented by the driver, a kernel that takes too long will look like a frozen driver. There is a watchdog timer that keeps track of this (5 seconds, I believe).
Your options are:
You need to make sure that your kernels are not too time-consuming
(best).
You can turn-off the watchdog timer: Timeout Detection and Recovery of GPUs.
You can run the kernel on a GPU that is not hooked up to a display.
I suggest you go with 1.
This answer explains how to actually do (2) in Windows 7. But the MSDN-page for these registry keys mentions they should not be manipulated by any applications outside targeted testing or debugging. So it might not be the best option, but it is an option.
Linux
(From Cuda Release Notes, but also applicable to OpenCL)
GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommeded that CUDA is run on a GPU that is NOT attached to an X display.
While X does not need to be running in order to use CUDA, X must have been initialized at least once after booting in order to properly load the NVIDIA kernel module. The NVIDIA kernel module remains loaded even after X shuts down, allowing CUDA to continue to function.
Mac
Apple apparently does not allow fiddling with this watchdog and thus the only option seems to be using a second GPU (without a screen attached to it)

CUDA/PyCUDA: Which GPU is running X11?

In a Linux system with multiple GPUs, how can you determine which GPU is running X11 and which is completely free to run CUDA kernels? In a system that has a low powered GPU to run X11 and a higher powered GPU to run kernels, this can be determined with some heuristics to use the faster card. But on a system with two equal cards, this method cannot be used. Is there a CUDA and/or X11 API to determine this?
UPDATE: The command 'nvidia-smi -a' shows a whether a "display" is connected or not. I have yet to determine if this means physically connected, logically connected (running X11), or both. Running strace on this command shows lots of ioctls being invoked and no calls to X11, so assuming that the card is reporting that a display is physically connected.
There is a device property kernelExecTimeoutEnabled in the cudaDeviceProp structure which will indicate whether the device is subject to a display watchdog timer. That is the best indicator of whether a given CUDA device is running X11 (or the windows/Mac OS equivalent).
In PyCUDA you can query the device status like this:
In [1]: from pycuda import driver as drv
In [2]: drv.init()
In [3]: print drv.Device(0).get_attribute(drv.device_attribute.KERNEL_EXEC_TIMEOUT)
1
In [4]: print drv.Device(1).get_attribute(drv.device_attribute.KERNEL_EXEC_TIMEOUT)
0
Here device 0 has a display attached, and device 1 is a dedicated compute device.
I don't know any library function which could check that. However a one "hack" comes in mind:
X11, or any other system component that manages a connected monitor must consume some of the GPU memory.
So, check if both devices report the same amount of available global memory through 'cudaGetDeviceProperties' and then check the value of 'totalGlobalMem' field.
If it is the same, try allocating that (or only slightly lower) amount of memory on each of the GPU and see which one fails to do that (cudaMalloc returning an error flag).
Some time ago I read somewhere (I don't remember where) that when you increase your monitor resolution, while there is an active CUDA context on the GPU, the context may get invalidated. That hints that the above suggestion might work. Note however that I never actually tried it. It's just my wild guess.
If you manage to confirm that it works, or that it doesn't, let us know!

Resources