Pytorch - Distributed Data Parallel Confusion

Pytorch - Distributed Data Parallel Confusion - pytorch

I was just looking at the DDP Tutorial:
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
According to this:
It’s common to use torch.save and torch.load to checkpoint modules
during training and recover from checkpoints. See SAVING AND LOADING
MODELS for more details. When using DDP, one optimization is to save
the model in only one process and then load it to all processes,
reducing write overhead. This is correct because all processes start
from the same parameters and gradients are synchronized in backward
passes, and hence optimizers should keep setting parameters to the
same values. If you use this optimization, make sure all processes do
not start loading before the saving is finished. Besides, when loading
the module, you need to provide an appropriate map_location argument
to prevent a process to step into others’ devices. If map_location is
missing, torch.load will first load the module to CPU and then copy
each parameter to where it was saved, which would result in all
processes on the same machine using the same set of devices. For more
advanced failure recovery and elasticity support, please refer to
TorchElastic.
I dont understand what this means. Shouldn't only one process/first GPU be saving the model? Is saving and loading how weights are shared across the processes/GPUs?

When you're using DistributedDataParallel you have the same model across multiple devices, which are being synchronised to have the exact same parameters.
When using DDP, one optimization is to save the model in only one process and then load it to all processes, reducing write overhead.
Since they are identical, it is unnecessary to save the models from all processes, as it would just write the same parameters multiple times. For example when you have 4 processes/GPUs you would write the same file 4 times instead of once. That can be avoided by only saving it from the main process.
That is an optimisation for the saving of the model. If you load the model right after you saved it, you need to be more careful.
If you use this optimization, make sure all processes do not start loading before the saving is finished.
If you save it in only one process, that process will take time to write the file. In the meantime all other processes continue and they might load the file before it was fully written to disk, which may lead to all sorts of unexpected behaviour or failure, whether that file does not exist yet, you are trying to read an incomplete file or you load an older version of the model (if you overwrite the same file).
Besides, when loading the module, you need to provide an appropriate map_location argument to prevent a process to step into others’ devices. If map_location is missing, torch.load will first load the module to CPU and then copy each parameter to where it was saved, which would result in all processes on the same machine using the same set of devices.
When saving the parameters (or any tensor for that matter) PyTorch includes the device where it was stored. Let's say you save it from the process that used GPU 0 (device = "cuda:0"), that information is saved and when you load it, the parameters are automatically put onto that device. But if you load it in the process that uses GPU 1 (device = "cuda:1"), you will incorrectly load them into "cuda:0". Now instead of using multiple GPUs, you have the same model multiple times in a single GPU. Most likely, you will run out of memory, but even if you don't, you won't be utilising the other GPUs anymore.
To avoid that problem, you should set the appropriate device for map_location of torch.load.
torch.load(PATH, map_location="cuda:1")
# Or load it on the CPU and later use .to(device) on the model
torch.load(PATH, map_location="cpu")

Related

Use PyTorch DistributedDataParallel with Hugging Face on Amazon SageMaker

Even for single-instance training, PyTorch DistributedDataParallel (DDP) is generally recommended over PyTorch DataParallel (DP) because DP's strategy is less performant and it uses more memory on the default device. (Per this PyTorch forums thread)
Hugging Face recommend to run distributed training via the python -m torch.distributed.launch launcher, because their Trainer API supports DDP but will fall back to DP if you don't. (Per this HF forums thread)
I recently ran in to this problem: scaling a HF training job from p3.8xlarge to p3.16xlarge increased memory consumption on (I think) one of the GPUs to the point where I had to significantly reduce batch size to avoid CUDA Out of Memory errors - basically losing all scaling advantage.
So the good news is for p3.16xl+ I can just enable SageMaker Distributed Data Parallel and the PyToch DLC will automatically launch via torch.distributed for me.
The bad news for use cases with smaller workloads or wanting to test before they scale up, is that SMDistributed doesn't support all multi-GPU instance types. No p3.8xl or g series, for example. I did try manually setting the sagemaker_distributed_dataparallel_enabled environment variable, but no joy.
So how else can we launch HF Trainer scripts with PyTorch DDP on SageMaker?

Great question, thanks for asking! PyTorch DDP runs data parallel workers in multiple processes, that must be launched and managed by developers. DDP should be seen as a managed allreduce, more than a managed data-parallelism library, since it requires you to launch and manage the workers and even assigning resources to workers. In order to launch the DDP processes in a SageMaker Training job you have many options:
If you do multi-GPU, single-machine, you can use torch.multiprocessing.spawn, as shown in this official PyTorch demo (that is broken by the way)
If you do multi-GPU, single-machine, you can also use the Ray Train library to launch those processes. I was able to use it in a Notebook, but not in the DLC yet (recent library that is a bit rough to learn and make work, see all my issues here). Ray Train should work on multi-node too.
If you do multi-GPU, any-machine, you can use torch.distributed.launch, wrapped in a launcher script in shell or Python. Example here https://gitlab.aws.dev/cruchant/a2d2-segmentation/-/blob/main/3_2D-Seg-Audi-A2D2-Distributed-Training-DDP.ipynb
You can also launch those processes with the SageMaker MPI integration instead of torch.distributed. Unfortunately, we didn't create documentation for this, so no one uses it nor pitches it. But it looks cool, because it allows to run copies of your script directly in the EC2 machines without the need to invoke an intermediary PyTorch launcher. Example here
So for now, my recommendation would be to go the route (3), which is the closest to what the PyTorch community does, so provides easier development and debugging path.
Notes:
PyTorch DDP evolves fast. In PT 1.10 torch.distributed is replaced by torchrun, and a torchX tool is being created to...simplify things!).
Not having to manage that mess is a reason why SageMaker Distributed Data Parallel is a great value prop: you only need to edit your script, and the SM service handles process creation. Unfortunately, as you point out, SMDP being limited to P3 and P4 training jobs seriously limits its use.
Below are important PT DDP concepts to understand to alter single-GPU code into multi-machine code
Unlike Apache Spark, which takes care of workload partitioning on your behalf, Pytorch distributed training requires the user to assign specific pieces of work to specific GPUs. In the following section, we assume that we train on GPU.
In PyTorch DDP, each GPU runs a customized copy of you training code. A copy of the training code running on one GPU is generally called a rank, a data parallel replica, a process, a worker, but other names may exist.
For PyTorch DDP to launch a training cluster on the MxN GPUs spread over your M machines, you must specify to PyTorch DDP the number of machines you have and the number of processes to launch per machine. This is respectively done by the parameters -nnodes and -nproc_per_node of the torch.distributed.launch utility. You must run the torch.distributed.lauch once on each node of the training cluster. You can achieve this parallel command with multiple tools, for example with MPI or SageMaker Training as mentioned above. In order to establish the necessary handshakes and form a cluster, you must also specify in the torch.distributed.launch command -node_rank, which must take a unique machine ID between 0 and N-1 on each of the machines, and -master_addr and -master_port, optional if you run a single-machine cluster, which must be the same across all machines.
In the init_process_group DDP initialization method running from within each data parallel replica script, you must specify the world size and replica ID, respectively with the world_size and rank parameters. Hence you must have a way to communicate to each script a unique ID, generally called the global rank. The global rank can help you personalize the work done by each GPU, for example saving a model just from one card, or running validation only in one card. In a cluster composed of 3 machines having 4 GPUs each, global ranks would range from 0 to 11. Within a machine, in order to assign DDP data parallel replicas to available GPUs, the script running in each replica must be assigned a GPU ID, unique within the machine it's running on. This is called the local rank and can be set as an argument by the PyTorch DDP torch.distributed.launch. In a cluster composed of 3 machines having 4 GPUs each, on each machine the DDP processes would have local ranks ranging from 0 to 3

If I Trace a PyTorch Network on Cuda, can I use it on CPU?

I traced my Neural Network using torch.jit.trace on a CUDA-compatible GPU server. When I reloaded that Trace on the same server, I could reload it and use it fine. Now, when I downloaded it onto my laptop (for quick testing), when I try to load the trace I get:
RuntimeError: Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. 'aten::empty_strided' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].
Can I not switch between GPU and CPU on a trace? Or is there something else going on?

I had this exact same issue. In my model I had one line of code that was causing this:
if torch.cuda.is_available():
weight = weight.cuda()
If you have a look at the official documentation for trace (https://pytorch.org/docs/stable/generated/torch.jit.trace.html) you will see that
the returned ScriptModule will always run the same traced graph on any input. This has some important implications when your module is expected to run different sets of operations, depending on the input and/or the module state
So, if the model was traced on a machine with GPU this operation will be recorded and you won't be able to even load your model a CPU only machine. To solve this, deleted everything that makes you model CUDA dependent. In my case it was as easy as deleting the code-block above.

Vulkan: Concurrent host-writes and device reads to separate parts of same VkMemory

To transfer my static data into the GPU, I'm thinking of having a single staging VkMemory object (ballpark 64MB), and using it as a rotating queue. However, I have multiple threads producing content (eg: rendering glyphs, loading files, procedural) and I'd like it if they could upload their data entirely by themselves (i.e. write plus submit Vulkan transfer commands).
I'm intending to keep the entire staging VkMemory permanently mapped (if this is dumb please say so) at least during loading (but perhaps longer if I want to stream data).
To achieve the above, once a thread's data is fully written/flushed to staging I'd like it to be able to immediately submit GPU transfer commands.
However, that means the GPU will be reading from one part of the VkMemory while other threads may be writing/flushing to it.
AFAIK I will also need to use image memory barriers for the transition from VK_IMAGE_LAYOUT_PREINITIALIZED to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL.
I couldn't find anything on the spec explicitly saying this was legal or illegal, only that care should be taken to ensure synchronization. However, I didn't find enough detail for me to be sure one way or the other.
NOTE: The staging queue will need to ensure transfers have been completed before overwriting anything - I intend to keep a complimentary queue of VkFences for this.
Questions:
Is this OK?
Do I need to align each separate object to a page boundary? Or something else.
Am I correct in assuming that the image memory barrier (above) won't require the device to write to staging memory.

yes the spec talks about the region being read from and written to must be synced.
if the memory is not coherent then you must align the blocks being read from or written to to NonCoherentAtomSize
source: Vulkan spec under the note after the declaration of vkMapMemory
vkMapMemory does not check whether the device memory is currently in
use before returning the host-accessible pointer. The application must
guarantee that any previously submitted command that writes to this
range has completed before the host reads from or writes to that
range, and that any previously submitted command that reads from that
range has completed before the host writes to that region (see here
for details on fulfilling such a guarantee). If the device memory was
allocated without the VK_MEMORY_PROPERTY_HOST_COHERENT_BIT set, these
guarantees must be made for an extended range: the application must
round down the start of the range to the nearest multiple of
VkPhysicalDeviceLimits::nonCoherentAtomSize, and round the end of the
range up to the nearest multiple of
VkPhysicalDeviceLimits::nonCoherentAtomSize.
a layout transition may write to the memory however barriers will do their own syncing with regards to previous and subsequent memory accesses.

Why whenever I look information on how to use the SDRAM of my DE1-SOC on internet, it takes me to use NIOS-II?

I'm doing a simple project of taking 100 numbers from an external memory (one by one), doing a simple arithmetic to that number (like adding 1) and returning it to another memory.
I successfully did that project "representing" a memory in verilog code, however I want now to synthesize my design but using the SDRAM of the board. The way I load data to the SDRAM or what I do with the resulting data outputted again to the SDRAM is irrelevant for my homework.
But I just can't understand what to do, all the information in internet takes me to the utilization of NIOS-II. Considering I have to load data to the SDRAM to make it able to serve me, and other reasons, maybe, is that NIOS-II is the most recommended way to do this? Can be done with out it, and would it be more practical?

this might not be the place to have your homework done. Additionally your question is very unclear. Let's try anyway:
I successfully did that project "representing" a memory in verilog code
I assume that you mean that you downloaded a model corresponding to the memory you have on your board.
taking 100 numbers from an external memory
I wonder how you do that. Did you load some initialization file or did you write the numbers first? In case of the first: this will not be synthesized and you might read random data, you should refer to the datasheet of your memory for this. If you expect specific values, you will need to write them to memory during some initialization procedure.
Of course you will need the correct constraints for your device. So I'd suggest that you take the NIOSII example, get it up and running and get rid of the NIOSII in a next step. At least you will be sure that the interfacing between controller and sdram is correct. Then read the datasheet of the controller. Probably you have a readstrobe, write strobe, data in, data out port, some configuration, perhaps a burstlength. If you need help with that you'll need to come up with a more specific question

ImageMagick's display GPU "memory leak"?

I'm testing CUDA app and I have run into strange memory issue:
My program performs some image operations and displays it using ImageMagick's display program.
The problem is that every time I run that IM's display I get more GPU memory usage, so less memory for GPU computation.
I'm using IM's display, because I couldn't find anything that displays image from the pipe input. Any suggestions?
Anyway why IM's display takes so much GPU memory and why is it not freed?

Based on your question, you're attempting to display a series of files in sequence using a shell not unlike Bash after performing a set of GPU-intensive operations. You're curious why more GPU memory is being consumed with every subsequent invocation of ImageMagick display, which appears to be closing out successfully after the conclusion of each operation.
We may further theorize that you're using ImageMagick's OpenCL support for at least some of your processing. While we don't have enough information to determine what your GPU's texture buffers look like at the completion of each rendering via display, I speculate your GPU isn't freeing textures expediently, causing memory to slowly creep up.
Instead of continuing to build conjecture around this hypothesis, I will instead recommend a tool to debug your issue: gDEBugger. This should allow you to interrogate your video card to determine exactly why things are slowing down.
Best of luck with your application.

I know it's old, but we have figured out that using pipes (popen()) makes sophisticated copy of the program in memory, what also causes copying the end program directives, or whatever called... So when I close program opened with popen I also finish all CUDA related context that are usually freed in "background", when program ends. So cleaning CUDA memory after I close popen application won't work, and I thing here was my memory leak and general major program error.
I hope someone will find it useful.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string