Use PyTorch DistributedDataParallel with Hugging Face on Amazon SageMaker - pytorch

Even for single-instance training, PyTorch DistributedDataParallel (DDP) is generally recommended over PyTorch DataParallel (DP) because DP's strategy is less performant and it uses more memory on the default device. (Per this PyTorch forums thread)
Hugging Face recommend to run distributed training via the python -m torch.distributed.launch launcher, because their Trainer API supports DDP but will fall back to DP if you don't. (Per this HF forums thread)
I recently ran in to this problem: scaling a HF training job from p3.8xlarge to p3.16xlarge increased memory consumption on (I think) one of the GPUs to the point where I had to significantly reduce batch size to avoid CUDA Out of Memory errors - basically losing all scaling advantage.
So the good news is for p3.16xl+ I can just enable SageMaker Distributed Data Parallel and the PyToch DLC will automatically launch via torch.distributed for me.
The bad news for use cases with smaller workloads or wanting to test before they scale up, is that SMDistributed doesn't support all multi-GPU instance types. No p3.8xl or g series, for example. I did try manually setting the sagemaker_distributed_dataparallel_enabled environment variable, but no joy.
So how else can we launch HF Trainer scripts with PyTorch DDP on SageMaker?

Great question, thanks for asking! PyTorch DDP runs data parallel workers in multiple processes, that must be launched and managed by developers. DDP should be seen as a managed allreduce, more than a managed data-parallelism library, since it requires you to launch and manage the workers and even assigning resources to workers. In order to launch the DDP processes in a SageMaker Training job you have many options:
If you do multi-GPU, single-machine, you can use torch.multiprocessing.spawn, as shown in this official PyTorch demo (that is broken by the way)
If you do multi-GPU, single-machine, you can also use the Ray Train library to launch those processes. I was able to use it in a Notebook, but not in the DLC yet (recent library that is a bit rough to learn and make work, see all my issues here). Ray Train should work on multi-node too.
If you do multi-GPU, any-machine, you can use torch.distributed.launch, wrapped in a launcher script in shell or Python. Example here https://gitlab.aws.dev/cruchant/a2d2-segmentation/-/blob/main/3_2D-Seg-Audi-A2D2-Distributed-Training-DDP.ipynb
You can also launch those processes with the SageMaker MPI integration instead of torch.distributed. Unfortunately, we didn't create documentation for this, so no one uses it nor pitches it. But it looks cool, because it allows to run copies of your script directly in the EC2 machines without the need to invoke an intermediary PyTorch launcher. Example here
So for now, my recommendation would be to go the route (3), which is the closest to what the PyTorch community does, so provides easier development and debugging path.
Notes:
PyTorch DDP evolves fast. In PT 1.10 torch.distributed is replaced by torchrun, and a torchX tool is being created to...simplify things!).
Not having to manage that mess is a reason why SageMaker Distributed Data Parallel is a great value prop: you only need to edit your script, and the SM service handles process creation. Unfortunately, as you point out, SMDP being limited to P3 and P4 training jobs seriously limits its use.
Below are important PT DDP concepts to understand to alter single-GPU code into multi-machine code
Unlike Apache Spark, which takes care of workload partitioning on your behalf, Pytorch distributed training requires the user to assign specific pieces of work to specific GPUs. In the following section, we assume that we train on GPU.
In PyTorch DDP, each GPU runs a customized copy of you training code. A copy of the training code running on one GPU is generally called a rank, a data parallel replica, a process, a worker, but other names may exist.
For PyTorch DDP to launch a training cluster on the MxN GPUs spread over your M machines, you must specify to PyTorch DDP the number of machines you have and the number of processes to launch per machine. This is respectively done by the parameters -nnodes and -nproc_per_node of the torch.distributed.launch utility. You must run the torch.distributed.lauch once on each node of the training cluster. You can achieve this parallel command with multiple tools, for example with MPI or SageMaker Training as mentioned above. In order to establish the necessary handshakes and form a cluster, you must also specify in the torch.distributed.launch command -node_rank, which must take a unique machine ID between 0 and N-1 on each of the machines, and -master_addr and -master_port, optional if you run a single-machine cluster, which must be the same across all machines.
In the init_process_group DDP initialization method running from within each data parallel replica script, you must specify the world size and replica ID, respectively with the world_size and rank parameters. Hence you must have a way to communicate to each script a unique ID, generally called the global rank. The global rank can help you personalize the work done by each GPU, for example saving a model just from one card, or running validation only in one card. In a cluster composed of 3 machines having 4 GPUs each, global ranks would range from 0 to 11. Within a machine, in order to assign DDP data parallel replicas to available GPUs, the script running in each replica must be assigned a GPU ID, unique within the machine it's running on. This is called the local rank and can be set as an argument by the PyTorch DDP torch.distributed.launch. In a cluster composed of 3 machines having 4 GPUs each, on each machine the DDP processes would have local ranks ranging from 0 to 3

Related

Results of Kernal PCA and LLE are different for different number of CPU cores provided for the run

I am doing dimensionality reduction using Scikit-Learns's KPCA and sometimes LLE APIs.
I have dataset which has a shape of around (700X150) all numerical.
I am just trying to pass this data to one of the above mentioned APIs to reduce its features, I have written a simple python script(say run.py) for it which I can run from terminal, that also saves the data after reduction.
What issue I am facing is, I am using "taskset" command in linux terminal to assign certain number of CPUs for a particular run. I can give any number of CPUs out of how much I have on my machine, for example, the terminal command could be:
taskset -c 1-3 python run.py when I want to give 3 cores
or taskset -c 1-2 python run.py when I want to use just 2 cores.
or simply just python run.py when I do not want to specify any CPU.
The problem is I am getting different results in all the three cases, by different results i mean output data of there three runs are different from one another, which should not happen since I using the script, same input data, and same algorithm(either KPCA or LLE) for all the three runs, I have also kept 'n_jobs' parameter to 2 because I am at least using 2 CPUs when I am using taskset. I have also supplied a random_state. All these 3 results are totally reproducible fortunately, that means the 1st command(with 3 cores) will produce same output data on every run, similarly 2nd and 3rd command also produces same results in each of their respective runs if run multiple times.
But the question why are these output different from each other ?
Setting up the taskset in my run is important for me because I am using a multi-core machine and I need to schedule different CPUs for different tasks, sometimes I have 2, sometimes I have 3, sometimes n number of CPUs for the same task which I give them accordingly but I don't want the results to be different based on how many CPUs I gave, this is affecting my classification performance as well which is later in the pipeline.
Also, done some experiments , I don't see this behavior when I use Isomap for reducing my data. The results are same doesn't matter how many CPUs I give.
I also used "numactl" command in place of "taskset" but the behavior was same.
Surprisingly, we could also see this same behaviour when using kpca function in R language! When I use R do to the same thing. Is there anything common and fundamental here regarding KPCA that I am missing ?
Please help.
Thanks,
Pranay
There might be something interesting in understanding exactly how the results differ. Algorithms like LLE, PCA and k-PCA that have a matrix factorization that has a sign ambiguity (e.g. in PCA, you can negate the component vectors and negate the coefficients and have the "same" answer). I'm not exactly what approach is being used for that matrix factorization, and what role randomization plays in that, and how it varies when it is parallelized, but it doesn't surprise me that it might be different when the computation is split across more processors, even with the same random seed.
TL;DR: If the results are different just in that some coordinates are negated, that isn't surprising. If they are more different than that, then I don't have a good answer.

If I Trace a PyTorch Network on Cuda, can I use it on CPU?

I traced my Neural Network using torch.jit.trace on a CUDA-compatible GPU server. When I reloaded that Trace on the same server, I could reload it and use it fine. Now, when I downloaded it onto my laptop (for quick testing), when I try to load the trace I get:
RuntimeError: Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. 'aten::empty_strided' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].
Can I not switch between GPU and CPU on a trace? Or is there something else going on?
I had this exact same issue. In my model I had one line of code that was causing this:
if torch.cuda.is_available():
weight = weight.cuda()
If you have a look at the official documentation for trace (https://pytorch.org/docs/stable/generated/torch.jit.trace.html) you will see that
the returned ScriptModule will always run the same traced graph on any input. This has some important implications when your module is expected to run different sets of operations, depending on the input and/or the module state
So, if the model was traced on a machine with GPU this operation will be recorded and you won't be able to even load your model a CPU only machine. To solve this, deleted everything that makes you model CUDA dependent. In my case it was as easy as deleting the code-block above.

Pytorch - Distributed Data Parallel Confusion

I was just looking at the DDP Tutorial:
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
According to this:
It’s common to use torch.save and torch.load to checkpoint modules
during training and recover from checkpoints. See SAVING AND LOADING
MODELS for more details. When using DDP, one optimization is to save
the model in only one process and then load it to all processes,
reducing write overhead. This is correct because all processes start
from the same parameters and gradients are synchronized in backward
passes, and hence optimizers should keep setting parameters to the
same values. If you use this optimization, make sure all processes do
not start loading before the saving is finished. Besides, when loading
the module, you need to provide an appropriate map_location argument
to prevent a process to step into others’ devices. If map_location is
missing, torch.load will first load the module to CPU and then copy
each parameter to where it was saved, which would result in all
processes on the same machine using the same set of devices. For more
advanced failure recovery and elasticity support, please refer to
TorchElastic.
I dont understand what this means. Shouldn't only one process/first GPU be saving the model? Is saving and loading how weights are shared across the processes/GPUs?
When you're using DistributedDataParallel you have the same model across multiple devices, which are being synchronised to have the exact same parameters.
When using DDP, one optimization is to save the model in only one process and then load it to all processes, reducing write overhead.
Since they are identical, it is unnecessary to save the models from all processes, as it would just write the same parameters multiple times. For example when you have 4 processes/GPUs you would write the same file 4 times instead of once. That can be avoided by only saving it from the main process.
That is an optimisation for the saving of the model. If you load the model right after you saved it, you need to be more careful.
If you use this optimization, make sure all processes do not start loading before the saving is finished.
If you save it in only one process, that process will take time to write the file. In the meantime all other processes continue and they might load the file before it was fully written to disk, which may lead to all sorts of unexpected behaviour or failure, whether that file does not exist yet, you are trying to read an incomplete file or you load an older version of the model (if you overwrite the same file).
Besides, when loading the module, you need to provide an appropriate map_location argument to prevent a process to step into others’ devices. If map_location is missing, torch.load will first load the module to CPU and then copy each parameter to where it was saved, which would result in all processes on the same machine using the same set of devices.
When saving the parameters (or any tensor for that matter) PyTorch includes the device where it was stored. Let's say you save it from the process that used GPU 0 (device = "cuda:0"), that information is saved and when you load it, the parameters are automatically put onto that device. But if you load it in the process that uses GPU 1 (device = "cuda:1"), you will incorrectly load them into "cuda:0". Now instead of using multiple GPUs, you have the same model multiple times in a single GPU. Most likely, you will run out of memory, but even if you don't, you won't be utilising the other GPUs anymore.
To avoid that problem, you should set the appropriate device for map_location of torch.load.
torch.load(PATH, map_location="cuda:1")
# Or load it on the CPU and later use .to(device) on the model
torch.load(PATH, map_location="cpu")

How to run tensorflow in single process mode?

I am trying to run tensorflow 1.13.1 with Python 2.7 on SLF 6 without GPU support. When I start my model, tensorflow appears to be spawning multiple subprocesses and running my model in parallel, trying to load every core in the system. While in most cases this is what one would probably want, this is not my case. I would like to run my model on single core only.
I have tried setting these variables:
export OMP_NUM_THREADS=1
export KMP_BLOCKTIME=0
export KMP_AFFINITY=granularity=fine,verbose,compact,1,0
in different combinations, but was not able to achieve single-core running.
Is there a way to run Tensorflow in "dumb" single-process mode ?
There are two configurable options regarding parallelism inter_op_parallelism_threads and intra_op_parallelism_threads in the tf.ConfigProto protocol buffer. To use a single process, I think you can try:
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1,
allow_soft_placement=True)
There are other possible forms of parallelism, see mrry# 's answer is this thread.

How to use Rmpi in R on linux Cluster to increase cores available with DEoptim?

I am using code developed in R to calibrate a hydrological model with 8 parameters using DEoptim (a function that aims to minimise an objective function). The DEoptim code uses the 'parallel' package to detect the number of cores available using 'DetectCores()'. On my PC I have 4 cores with 2 threads each so it detects 8 cores and then sends out the hydrological model to a core with different values of parameters and the results are returned to the centre. It does this hundreds or thousands of times and iterates the parameters to try and find an optimum set. Therefore the more cores available, the faster it will work.
I am at a university and have access to a Linux compute cluster. They have servers with up to 12 cores (i.e. not threads) and if I used this it would work two - three times faster than my PC. Great. However, ideally I would spread the code around other servers so I could have access to more cores and all the info sent back the master.
Therefore, my question is how could I include Rmpi in my code to effectively increase the cores available. As you can probably tell, I am quite new to using clusters.
Many thanks, Antony
If you want to execute DEoptim on multiple nodes of a Linux cluster, I believe you'll need to use foreach by specifying parallelType=2 in the control argument. You can use either the doMPI parallel backend or the doParallel backend with an MPI cluster object. For example:
library(doParallel)
library(Rmpi)
cl <- makeCluster(mpi.universe.size()-1, type='MPI')
registerDoParallel(cl)
# and eventually...
DEoptim(fn=Genrose, lower=rep(-25, n), upper=rep(25, n),
control=list(NP=10*n, itermax=maxIt, parallelType=2))
You'll need to have the snow package installed in addition to the others. Also, make sure that you execute your script with mpirun using the -np 1 option. If you don't use mpirun, the workers will all be spawned on the local machine.

Resources