I have a machine with 2 GPUs.
Quite often, one is used in production (i.e doing predictions with the already trained model), while the other is used for training and experimenting new models.
While I was using theano, I had no problem running my scripts on only one GPU by specifying a flag as follow
THEANO_FLAGS="device=cuda0" training_script.py
THEANO_FLAGS="device=cuda1" prediction_script.py
Is there a simple way to do the same in Keras with a Tensorflow backend ? Default behavior seem to map all the memory of all the GPUs for one session
(Please note that I don't really care if each script maps a whole GPU separately, even if they could work using less memory)
You can easily choose one gpu. Just fill 0 or 1 on CUDA_VISIBLE_DEVICES
import os
os.environ["CUDA_VISIBLE_DEVICES"]="1"
Furthermore if you want to spesify a portion of gpu for the selected gpu above, add:
from keras import backend as K
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4 #what portion of gpu to use
session = tf.Session(config=config)
K.set_session(session)
Related
There are 3 GPUs in my system.
I want to run on the last one i.e. 2. For this reason, I set gpu_id as 2 in my configuration file as well as CUDA_VISIBLE_DEVICES=2. But in my program, the following line always assigns the 0th GPU.
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
How to fix this issue?
When setting CUDA_VISIBLE_DEVICES=2 you tell the OS to only expose the third GPU to your process. That is, as far as PyTorch is concerned, there is only one GPU. Therefore torch.distributed.get_world_size() returns 1 (and not 3).
The rank of this GPU, in your process, will be 0 - since there are no other GPUs available for the process. But as far as the OS is concerned - all processing are done on the third GPU that was allocated to the job.
Even for single-instance training, PyTorch DistributedDataParallel (DDP) is generally recommended over PyTorch DataParallel (DP) because DP's strategy is less performant and it uses more memory on the default device. (Per this PyTorch forums thread)
Hugging Face recommend to run distributed training via the python -m torch.distributed.launch launcher, because their Trainer API supports DDP but will fall back to DP if you don't. (Per this HF forums thread)
I recently ran in to this problem: scaling a HF training job from p3.8xlarge to p3.16xlarge increased memory consumption on (I think) one of the GPUs to the point where I had to significantly reduce batch size to avoid CUDA Out of Memory errors - basically losing all scaling advantage.
So the good news is for p3.16xl+ I can just enable SageMaker Distributed Data Parallel and the PyToch DLC will automatically launch via torch.distributed for me.
The bad news for use cases with smaller workloads or wanting to test before they scale up, is that SMDistributed doesn't support all multi-GPU instance types. No p3.8xl or g series, for example. I did try manually setting the sagemaker_distributed_dataparallel_enabled environment variable, but no joy.
So how else can we launch HF Trainer scripts with PyTorch DDP on SageMaker?
Great question, thanks for asking! PyTorch DDP runs data parallel workers in multiple processes, that must be launched and managed by developers. DDP should be seen as a managed allreduce, more than a managed data-parallelism library, since it requires you to launch and manage the workers and even assigning resources to workers. In order to launch the DDP processes in a SageMaker Training job you have many options:
If you do multi-GPU, single-machine, you can use torch.multiprocessing.spawn, as shown in this official PyTorch demo (that is broken by the way)
If you do multi-GPU, single-machine, you can also use the Ray Train library to launch those processes. I was able to use it in a Notebook, but not in the DLC yet (recent library that is a bit rough to learn and make work, see all my issues here). Ray Train should work on multi-node too.
If you do multi-GPU, any-machine, you can use torch.distributed.launch, wrapped in a launcher script in shell or Python. Example here https://gitlab.aws.dev/cruchant/a2d2-segmentation/-/blob/main/3_2D-Seg-Audi-A2D2-Distributed-Training-DDP.ipynb
You can also launch those processes with the SageMaker MPI integration instead of torch.distributed. Unfortunately, we didn't create documentation for this, so no one uses it nor pitches it. But it looks cool, because it allows to run copies of your script directly in the EC2 machines without the need to invoke an intermediary PyTorch launcher. Example here
So for now, my recommendation would be to go the route (3), which is the closest to what the PyTorch community does, so provides easier development and debugging path.
Notes:
PyTorch DDP evolves fast. In PT 1.10 torch.distributed is replaced by torchrun, and a torchX tool is being created to...simplify things!).
Not having to manage that mess is a reason why SageMaker Distributed Data Parallel is a great value prop: you only need to edit your script, and the SM service handles process creation. Unfortunately, as you point out, SMDP being limited to P3 and P4 training jobs seriously limits its use.
Below are important PT DDP concepts to understand to alter single-GPU code into multi-machine code
Unlike Apache Spark, which takes care of workload partitioning on your behalf, Pytorch distributed training requires the user to assign specific pieces of work to specific GPUs. In the following section, we assume that we train on GPU.
In PyTorch DDP, each GPU runs a customized copy of you training code. A copy of the training code running on one GPU is generally called a rank, a data parallel replica, a process, a worker, but other names may exist.
For PyTorch DDP to launch a training cluster on the MxN GPUs spread over your M machines, you must specify to PyTorch DDP the number of machines you have and the number of processes to launch per machine. This is respectively done by the parameters -nnodes and -nproc_per_node of the torch.distributed.launch utility. You must run the torch.distributed.lauch once on each node of the training cluster. You can achieve this parallel command with multiple tools, for example with MPI or SageMaker Training as mentioned above. In order to establish the necessary handshakes and form a cluster, you must also specify in the torch.distributed.launch command -node_rank, which must take a unique machine ID between 0 and N-1 on each of the machines, and -master_addr and -master_port, optional if you run a single-machine cluster, which must be the same across all machines.
In the init_process_group DDP initialization method running from within each data parallel replica script, you must specify the world size and replica ID, respectively with the world_size and rank parameters. Hence you must have a way to communicate to each script a unique ID, generally called the global rank. The global rank can help you personalize the work done by each GPU, for example saving a model just from one card, or running validation only in one card. In a cluster composed of 3 machines having 4 GPUs each, global ranks would range from 0 to 11. Within a machine, in order to assign DDP data parallel replicas to available GPUs, the script running in each replica must be assigned a GPU ID, unique within the machine it's running on. This is called the local rank and can be set as an argument by the PyTorch DDP torch.distributed.launch. In a cluster composed of 3 machines having 4 GPUs each, on each machine the DDP processes would have local ranks ranging from 0 to 3
I trained a torchvision mask r-cnn model on GPU and saved it to disk using torch.save(model, model_name). On another machine, without GPU, I try to load it again using torch.load(model_name). The model cannot be deserializised because torch does not know about device cuda:0.
How can I 'convert' such a model to be used on non-GPU environments?
I assume it is best practice to move a model to CPU before saving it?
torch.load() has an argument map_location where you can specify the device. So you can use
torch.load(..., map_location='cpu')
or specify any other device to directly load it there.
I am using A100-SXM4-40GB Gpu but training is terribly slow. I tried two models, a simple classification on cifar and a Unet on Cityscapes. I tried my code on other GPUs and it worked totally fine, but I do not know why training on this high capacity GPU is super slow.
I would appreciate any help.
Here are some other properties of GPUs.
GPU 0: A100-SXM4-40GB
GPU 1: A100-SXM4-40GB
GPU 2: A100-SXM4-40GB
GPU 3: A100-SXM4-40GB
Nvidia driver version: 460.32.03
cuDNN version: Could not collect
Thank you for your answer. Before trying your answer, I decided to uninstall anaconda and reinstall it and this solved the problem.
Call .cuda() on the model during initialization.
As per your above comments, you have GPUs, as well as CUDA installed, so there's no point of checking the device availability with torch.cuda.is_available().
Additionally, you should wrap your model in nn.DataParallel to allow PyTorch use every GPU you expose it to. You also could do DistributedDataParallel, but DataParallel is easier to grasp initially.
Example initialization:
model = UNet().cuda()
model = torch.nn.DataParallel(model)
Also, you can be sure you're exposing the code to all GPUs by executing the python script with the following flag:
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 train_unet.py
Last thing to note - nn.DataParallel encapsulates the model itself, so for saving the state_dict, you'll need to reach module inside DataParallel:
torch.save(model.module.state_dict(), 'unet.pth')
I am trying to run tensorflow 1.13.1 with Python 2.7 on SLF 6 without GPU support. When I start my model, tensorflow appears to be spawning multiple subprocesses and running my model in parallel, trying to load every core in the system. While in most cases this is what one would probably want, this is not my case. I would like to run my model on single core only.
I have tried setting these variables:
export OMP_NUM_THREADS=1
export KMP_BLOCKTIME=0
export KMP_AFFINITY=granularity=fine,verbose,compact,1,0
in different combinations, but was not able to achieve single-core running.
Is there a way to run Tensorflow in "dumb" single-process mode ?
There are two configurable options regarding parallelism inter_op_parallelism_threads and intra_op_parallelism_threads in the tf.ConfigProto protocol buffer. To use a single process, I think you can try:
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1,
allow_soft_placement=True)
There are other possible forms of parallelism, see mrry# 's answer is this thread.