I am using pytorch to train networks. But I find dataloader is very slow even with multi workers. After some debug I find that the disk I/O is kind of strange.
As you can see,I have many worke with python command,but the reading speed is 0?Only the mount.ntfs command have all the reading speed which is far from the full speed(>120M/s). Why?
Thanks for help!
Related
Even for single-instance training, PyTorch DistributedDataParallel (DDP) is generally recommended over PyTorch DataParallel (DP) because DP's strategy is less performant and it uses more memory on the default device. (Per this PyTorch forums thread)
Hugging Face recommend to run distributed training via the python -m torch.distributed.launch launcher, because their Trainer API supports DDP but will fall back to DP if you don't. (Per this HF forums thread)
I recently ran in to this problem: scaling a HF training job from p3.8xlarge to p3.16xlarge increased memory consumption on (I think) one of the GPUs to the point where I had to significantly reduce batch size to avoid CUDA Out of Memory errors - basically losing all scaling advantage.
So the good news is for p3.16xl+ I can just enable SageMaker Distributed Data Parallel and the PyToch DLC will automatically launch via torch.distributed for me.
The bad news for use cases with smaller workloads or wanting to test before they scale up, is that SMDistributed doesn't support all multi-GPU instance types. No p3.8xl or g series, for example. I did try manually setting the sagemaker_distributed_dataparallel_enabled environment variable, but no joy.
So how else can we launch HF Trainer scripts with PyTorch DDP on SageMaker?
Great question, thanks for asking! PyTorch DDP runs data parallel workers in multiple processes, that must be launched and managed by developers. DDP should be seen as a managed allreduce, more than a managed data-parallelism library, since it requires you to launch and manage the workers and even assigning resources to workers. In order to launch the DDP processes in a SageMaker Training job you have many options:
If you do multi-GPU, single-machine, you can use torch.multiprocessing.spawn, as shown in this official PyTorch demo (that is broken by the way)
If you do multi-GPU, single-machine, you can also use the Ray Train library to launch those processes. I was able to use it in a Notebook, but not in the DLC yet (recent library that is a bit rough to learn and make work, see all my issues here). Ray Train should work on multi-node too.
If you do multi-GPU, any-machine, you can use torch.distributed.launch, wrapped in a launcher script in shell or Python. Example here https://gitlab.aws.dev/cruchant/a2d2-segmentation/-/blob/main/3_2D-Seg-Audi-A2D2-Distributed-Training-DDP.ipynb
You can also launch those processes with the SageMaker MPI integration instead of torch.distributed. Unfortunately, we didn't create documentation for this, so no one uses it nor pitches it. But it looks cool, because it allows to run copies of your script directly in the EC2 machines without the need to invoke an intermediary PyTorch launcher. Example here
So for now, my recommendation would be to go the route (3), which is the closest to what the PyTorch community does, so provides easier development and debugging path.
Notes:
PyTorch DDP evolves fast. In PT 1.10 torch.distributed is replaced by torchrun, and a torchX tool is being created to...simplify things!).
Not having to manage that mess is a reason why SageMaker Distributed Data Parallel is a great value prop: you only need to edit your script, and the SM service handles process creation. Unfortunately, as you point out, SMDP being limited to P3 and P4 training jobs seriously limits its use.
Below are important PT DDP concepts to understand to alter single-GPU code into multi-machine code
Unlike Apache Spark, which takes care of workload partitioning on your behalf, Pytorch distributed training requires the user to assign specific pieces of work to specific GPUs. In the following section, we assume that we train on GPU.
In PyTorch DDP, each GPU runs a customized copy of you training code. A copy of the training code running on one GPU is generally called a rank, a data parallel replica, a process, a worker, but other names may exist.
For PyTorch DDP to launch a training cluster on the MxN GPUs spread over your M machines, you must specify to PyTorch DDP the number of machines you have and the number of processes to launch per machine. This is respectively done by the parameters -nnodes and -nproc_per_node of the torch.distributed.launch utility. You must run the torch.distributed.lauch once on each node of the training cluster. You can achieve this parallel command with multiple tools, for example with MPI or SageMaker Training as mentioned above. In order to establish the necessary handshakes and form a cluster, you must also specify in the torch.distributed.launch command -node_rank, which must take a unique machine ID between 0 and N-1 on each of the machines, and -master_addr and -master_port, optional if you run a single-machine cluster, which must be the same across all machines.
In the init_process_group DDP initialization method running from within each data parallel replica script, you must specify the world size and replica ID, respectively with the world_size and rank parameters. Hence you must have a way to communicate to each script a unique ID, generally called the global rank. The global rank can help you personalize the work done by each GPU, for example saving a model just from one card, or running validation only in one card. In a cluster composed of 3 machines having 4 GPUs each, global ranks would range from 0 to 11. Within a machine, in order to assign DDP data parallel replicas to available GPUs, the script running in each replica must be assigned a GPU ID, unique within the machine it's running on. This is called the local rank and can be set as an argument by the PyTorch DDP torch.distributed.launch. In a cluster composed of 3 machines having 4 GPUs each, on each machine the DDP processes would have local ranks ranging from 0 to 3
I am running a code in Google Colab for training a neural network.
All my scripts have been working just fine, but starting this week, I have been receiving this error:
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
which seems to occur at random. Sometimes it occurs at the beginning of my script run, say, even before epoch 1, some other times at epoch 160 or 56 or so. Nonetheless, it seems to always point to this sentence: loss.backward().
I'm running the code over GPU and have the paid subscription to Colab Pro.
Does anybody have faced this issue? I read somewhere that this seems to be a problem of the GPU running out of memory, however, can't say that for sure given the error messages I'm receiving.
Well, It took a while but I managed to find the source of this problem myself. Some other posts mentioned this could be a GPU memory issue so I tried to minimize the memory usage as much as possible. Though this was good for my code, it didn't solve the problem.
Others talked about switching to CPU and running the script to get a better error message (which I did and took forever). Running my script with CPU gave the error of binary cross entropy not receiving inputs in the zero to one interval. This was clearly not the problem since those inputs can from a sigmoid function.
Finally, I recall the last thing I changed before my script started behaving like this and I found out that it was because of the learning rate. When I ran my training with a learning rate of 0.001, everything was fine. I switched it to 0.02 (20 times higher) and then I started receiving this execution errors at random. Switching back to the smaller learning rate solved the problem immediately. No more GPU errors and now I'm happy.
So, if you have this issue, you my take a look to the learning rate and hopefully will help you.
I am using A100-SXM4-40GB Gpu but training is terribly slow. I tried two models, a simple classification on cifar and a Unet on Cityscapes. I tried my code on other GPUs and it worked totally fine, but I do not know why training on this high capacity GPU is super slow.
I would appreciate any help.
Here are some other properties of GPUs.
GPU 0: A100-SXM4-40GB
GPU 1: A100-SXM4-40GB
GPU 2: A100-SXM4-40GB
GPU 3: A100-SXM4-40GB
Nvidia driver version: 460.32.03
cuDNN version: Could not collect
Thank you for your answer. Before trying your answer, I decided to uninstall anaconda and reinstall it and this solved the problem.
Call .cuda() on the model during initialization.
As per your above comments, you have GPUs, as well as CUDA installed, so there's no point of checking the device availability with torch.cuda.is_available().
Additionally, you should wrap your model in nn.DataParallel to allow PyTorch use every GPU you expose it to. You also could do DistributedDataParallel, but DataParallel is easier to grasp initially.
Example initialization:
model = UNet().cuda()
model = torch.nn.DataParallel(model)
Also, you can be sure you're exposing the code to all GPUs by executing the python script with the following flag:
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 train_unet.py
Last thing to note - nn.DataParallel encapsulates the model itself, so for saving the state_dict, you'll need to reach module inside DataParallel:
torch.save(model.module.state_dict(), 'unet.pth')
I train a network on GPU with Pytorch. However, after at most 3 epochs, code stops with a message :
Killed
No other error message is given.
I monitored the memory and gpu usage, there were still space during the run. I reviewed the /var/sys/dmesg to find a detailed message regarding to this, however no message with "kill" was entered. What might be the problem?
Cuda version: 9.0
Pytorch version: 1.1.0
If you had root access, you could check whether this is memory issue or not by dmesg command.
In my case, the process was killed by kernel due to out of memory.
I found the cause to be saving tensors require grad to a list and each of those stores an entire computation graph, which consumes significant memory.
I fixed the issue by saving .detach() tensor instead of saving tensors returned by loss function to the list.
You can type "dmesg" on your terminal and scroll down to the bottom. It will show you the message of why it is killed.
Since you mentioned PyTorch, the chances are that your process is killed due to "Out of Memory". To resolve this, reduce your batch size till you no longer see the error.
Hope this helps! :)
In order to give an idea to people who will enconter this:
Apparently, Slurm was installed on the machine so that I needed to give the tasks on Slurm.
I have an application written in Fortran which makes use of parallel HDF5 for input / output.
A matching post-processing code is used to read its output, in the form of a *.h5 file, and process it.
When I try to use valgrind to check for memory leaks, however, it stalls when reading large datasets.
More exactly, the stalling occurs at a call to H5Dread_f for large datasets, for example 1069120 doubles (where doubles are defined as H5kind_to_type(REAL64,H5_REAL_KIND)), whereas for smaller ones it is okay.
I tried recompiling the HDF5 library using --enable-using-memchecker, as described here, but it didn't help.
Does anybody have more experience with this?
I found the cause / solution: Due to an error of mine, the chunk size I used in these HDF5 routines was often only 1 byte, which is of course much too low.
Fixing this also makes valgrind much much faster and usuable.