Distributed Data Parallel (DDP) Batch size - pytorch

Suppose, I use 2 gpus in a DDP setting.
So, if I intend to use 16 as a batch size if I run the experiment on a single gpu,
should I give 8 as a batch size, or 16 as a batch size in case of using 2 gpus with DDP setting??
Does 16 is divided into 8 and 8 automatically?
Thank you -!

No, it won't be split automatically.
When you set batch_size=8 under DDP mode, each GPU will receive dataset with batch_size=8, so the global batch_size=16

As explained here:
the application of the given module by splitting the input across the specified devices
The batch size should be larger than the number of GPUs used locally
each replica handles a portion of the input
If you use 16 as batch-size, it will be divided automatically between the two gpus.

Related

Slurm: can i create e sub-queue using a subset of resources in a single node?

I have a use case with slurm and I wonder if there is a way to handle it.
Constraints:
I would like to run several jobs (say 60 jobs).
Each one takes a few hours, e.g. 3h/job.
In the cluster managed by slurm, I use a queue with 2 nodes with 4 gpus each (so I can restrict my batch script to one node).
Each job takes 1 gpu.
Problem: if I put everything in the queue, I will block 4 gpus even if I specify only 1 node.
Desired solution: avoid blocking a whole machine by taking, say, 2 gpus only.
How can I put them in the queue without them taking all 4 gpus?
Could I create a kind of sub-file that would be limited to a subset of resources of a node for example?
You can use the Slurm consumable trackable resources plug-in (cons_tres enabled in your slurm.conf file- more info here: https://slurm.schedmd.com/cons_res.html#using_cons_tres) to:
Specify the --gpus-per-task=X
-or-
Bind a specific number of gpus to the task with --gpus=X
-or-
Bind the task to a specific gpu by its ID with --gpu-bind=GPUID

On batch size, epochs, and learning rate of DistributedDataParallel

I have read these threads [1] [2] [3] [4], and this article.
I think I got how batch size and epochs works with DDP, but I am not sure about the learning rate.
Let's say I have a dataset of 100 * 8 images. In a non-distributed scenario, I set the batch size to 8, so each epoch will do 100 gradient steps.
Now I am in a multi-node multi-gpu scenario, with 2 nodes and 4 GPUs (so world size is 8).
I understand that I need to pass batches of 8 / 8 = 1, because each update will aggregate the gradients from the 8 GPUs. In each worker, the data loader will load still 100 batches, but each of 1 sample. So the whole dataset is parsed exactly once per epoch.
I checked and everything seems like that.
But what about the learning rate?
According to the official doc
When a model is trained on M nodes with batch=N, the gradient will be
M times smaller when compared to the same model trained on a single
node with batch=M*N if the loss is summed (NOT averaged as usual)
across instances in a batch (because the gradients between different
nodes are averaged). [...] But in most cases, you can just treat a
DistributedDataParallel wrapped model, a DataParallel wrapped model
and an ordinary model on a single GPU as the same (E.g. using the same
learning rate for equivalent batch size).
I understand that the gradients are averaged, so if the loss is averaged over samples nothing changes, while if it is summer we need to account for that. But does 'nodes' refer to the total number of GPUs across all cluster nodes (world size) or just cluster nodes? In my example, would M be 2 or 8? Some posts in the threads I linked say that the gradient is divided 'by the number of GPUs'. How exactly is the gradient aggregated?
Please refer to the following discussion:
https://github.com/PyTorchLightning/pytorch-lightning/discussions/3706
"As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.
Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes
In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512."

How is the maximum number of workers in PyTorch DataLoader decided if there are 2 CPUs in a node?

I was trying to run an autoencoder using an HPC node that had 2 CPUs and each of them had 20 cores (so 40 CPUs per node). When using torch.utils.data.DataLoader, I specified num_workers as 40 instead of 20 since I thought that there were 40 CPU cores available.
However, I got the following warning:
/jet/home/wehs7661/.conda/envs/diffnets/lib/python3.6/site-packages/torch/utils/data/dataloader.py:481: UserWarning: This `DataLoader` will create 40 worker processes in total. Our suggested max number of worker in current system is 20, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
This kind of confused me since I thought that the maximum number of workers should be 40 instead of 20. I'm wondering how the maximum number of workers is decided in DataLoader and if I should change num_workers to 20 even if I actually have 40 cores.
The DataLoader class in PyTorch that helps us to load and iterate over elements in a dataset is the class that is used for DataLoader in the torch.utils.data module. DataLoader. It uses arguments as follow:
DataLoader(
dataset,
batch_size=1,
shuffle=False,
num_workers='',
collate_fn=None,
pin_memory=False,
)
So if you are seeking to work with main processor to load your data when needed, set
num_workers=0. you will get out of the the problem as soon as you did it!
Ref Click here!

Is there a PyTorch with CUDA Unified GPU-CPU Memory fork?

So Training a DNN model can be a pain when a batch of one image takes 15GB. Speed is not so important for me, yet to fit bigger batches (and models is). So I wonder if there is a PyTorch with CUDA Unified Memory fork or something like that to fit giant models (having 16gb per GPU RAM, yet 250 on CPU side it seems quite resonable)?
If you do not care about the time it takes, but need large batches, you can use a more slow approach. Say you need to have a batch of 128 samples but your gpu memory fits only 8 samples. You can create smaller batches of 8 samples and then average their gradients.
For each small batch of 8 samples that you evaluate, you keep the .grad of each parameter in your cpu memory. You keep a list of grads for each of your models parameters. After you have gathered the grads for 16 batches of 8 samples (128 samples in total) you can average the gradients of each parameter and put the result back into the .grad attribute of each parameter.
You can then call the .step() of your optimizer. This should yield exactly the same results as if you were using a large batch of 128 samples.

Random Forest: Running out of memory

I'm using scikit-learn Random Forest to fit a training data (~30mb) and my laptop keeps crashing running of out application memory. The test data is a few times bigger than the training data. Using Macbook Air 2GHz 8GB memory.
What are some of the ways to deal with this?
rf = RandomForestClassifier(n_estimators = 100, n_jobs=4)
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rf, X_train_a, y_train, cv=20, scoring='roc_auc'))
Your best choice is to tune the arguments.
n_jobs=4
This makes the computer compute four train-test cycles simultaneously. Different Python jobs run in separate processes, thus the full dataset is also copied. Try to reduce n_jobs to 2 or 1 to save memory. n_jobs==4 uses four times the memory n_jobs==1 uses.
cv=20
This splits the data into 20 pieces and the code does 20 train-test iterations. This means that the training data is the size of 19 pieces of the original data. You can quite safely reduce it to 10, however your accuracy estimate might get worse. It won't save much memory, but makes runtime faster.
n_estimators = 100
Reducing this will save little memory, but it will make the algorithm run faster as the random forest will contain fewer trees.
To sum up, I'd recommend reducing n_jobs to 2 to save the memory (2-fold increase in runtime). To compensate runtime, I'd suggest changing cv to 10 (2-fold savings in runtime). If that does not help, change n_jobs to 1 and also reduce the number of estimators to 50 (extra two times faster processing).
I was dealing with ~4MB dataset and Random Forest from scikit-learn with default hyper-parameters was ~50MB (so more than 10 times of the data). By setting the max_depth = 6 the memory consumption decrease 66 times. The performance of shallow Random Forest on my dataset improved!
I write down this experiment in the blog post.
From my experience, in the case of regression tasks the memory usage can grow even much more, so it is important to control the tree depth. The tree depth can be controlled directly with max_depth or by tuning: min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_features, max_leaf_nodes.
The memory of the Random Forest can be of course controlled with number of trees in the ensemble.

Resources