I wrote a code for my CNN. I was tuning my model with 10 iteration of 20 epochs. When I run The code on my local 4gb gpu, Ram exhausted on 9th iteration. When I run same code on google colab 12gb ram exhaust on 1st iteration. How can it possible. It has thrice the size of my gpu. Still use more ram. Please explain.
Related
I am using Stable diffusion inpainting pipeline to generate some inference results on a A100 (40 GB) GPU. For a 512X512 image it is taking approx 3 s per image and takes about 5 GB of space on the GPU.
In order to have faster inference, I am trying to run 2 threads (2 inference scripts). However, as soon as I start them simultaneously. The inference time decreases to ~6 sec per thread with an effective time of ~3 s per image.
I am unable to understand why this is so. I still have a lot of space available (about 35 GB) on GPU and quite a big CPU ram of 32 GB.
Can someone help me in this regard?
Regardless of the VRAM requirements, if the stable diffusion model is using most of the SMs on the GPU there is no hardware left able to parallelize the inference of two images in the same GPU.
I'm working on feature generation before I train a model in PyTorch. I wish to save my features as PyTorch tensors on disk for later use in training.
One of my features ("Feature A") is calculated on a CPU while another feature ("Feature B") must be calculated from that CPU on a GPU (some linear algebra stuff). I have an unusual limitation: on my university cluster, jobs which don't use GPUs have CPU memory limits of 1TB each while jobs which do use GPUs have CPU memory limits of 4GB with GPU memory limits of 48GB. Feature A and Feature B are each approximately 10GB.
Naturally, I want to first calculate Feature A using CPUs only then save Feature A to disk. In another job (this one with GPU access and thus the 4GB CPU memory limitation), I want to load Feature A directly to GPU, compute Feature B, then save Feature B to disk.
With Feature A computed and saved to disk, I've tried:
feaB = torch.load(feaAfile, map_location=torch.device('cuda'))
And yet I max-out my CPU memory. I've confirmed cuda is available.
In the PyTorch documentation I see that in loading tensors they "are first deserialized on the CPU..."
I wonder if there is any way to avoid a CPU memory implication when I want to load only onto the GPU? If the tensor must first be copied to the CPU, could I use some sort of 4GB buffer? Thanks so much in advance.
EDIT: per discussion in the comments, I no longer need to do this. But the question itself, of loading a tensor to the GPU without using CPU memory, remains unanswered so I'm leaving this question up.
Running "Get started with TensorFlow 2.0 for beginners" from
https://www.tensorflow.org/beta/tutorials/quickstart/beginner
in Colab
https://colab.research.google.com/github/tensorflow/docs/blob/r2.0rc/site/en/r2/tutorials/quickstart/beginner.ipynb
works fine and only takes a few seconds. The output generated is:
But I would like to run it locally. I extracted the python code from the notebook. When started, the output does not look as intended (problem with backspace?) and the ETA (estimated arrival time) keeps growing and the program does not finish within reasonable time.
Can you please help me out finding what the problem is?
The tutorial on Colab uses CPU by default, so make sure you are not using GPU there. If not then look at your CPU RAM power, on colab, CPU has 13 GB RAM approx. Here are the specs of colab. The problem is mostly because of the CPU power.
I need to create variables directly on the GPU because I am very limited in my CPU ram.
I found the method to do this here
https://discuss.pytorch.org/t/how-to-create-a-tensor-on-gpu-as-default/2128/3
Which mentions using
torch.set_default_tensor_type(‘torch.cuda.FloatTensor’)
However, when I tried
torch.set_default_tensor_type('torch.cuda.FloatTensor')
pytorchGPUDirectCreate = torch.FloatTensor(20000000, 128).uniform_(-1, 1).cuda()
It still seemed to take up mostly CPU RAM, before being transferred to GPU ram.
I am using Google Colab. To view RAM usage during the variable creation process, after running the cell, go to Runtime -> Manage Sessions
With and without using torch.set_default_tensor_type('torch.cuda.FloatTensor') , the CPU RAM bumps up to 11.34 GB while GPU ram stays low, and then GPU RAM goes to 9.85 and CPU ram goes back down.
It seems that torch.set_default_tensor_type(‘torch.cuda.FloatTensor’) didn't make a difference
For convenience here's a direct link to a notebook anyone can directly run
https://colab.research.google.com/drive/1LxPMHl8yFAATH0i0PBYRURo5tqj0_An7
Allocation of Memory using cupy throws an Out of Memory allocation problem.Consumption of memory out of 12 GB gets almost completed before even starting training. Now,during training all the memory is consumed.
Everything is working fine on CPU.
I have tried reducing batch size from 100 to single digit(4).
I have changed the small task into numpy since cupy consumes memory.
I have tried using more than one gpu but encounter same problem
please_refer_this
Please note, It work's on CPU *
Results are fine but I need to trained more for that using cupy and GPU is necessary.