Error: OOM when allocating tensor with shape - python-3.x

i am facing issue with my inception model during the performance testing with Apache JMeter.
Error: OOM when allocating tensor with shape[800,1280,3] and type
float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator
GPU_0_bfc [[Node: Cast = CastDstT=DT_FLOAT, SrcT=DT_UINT8,
_device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens,
add report_tensor_allocations_upon_oom to RunOptions for current
allocation info.

OOM stands for Out Of Memory. That means that your GPU has run out of space, presumably because you've allocated other tensors which are too large. You can fix this by making your model smaller or reducing your batch size. By the looks of it, you're feeding in a large image (800x1280) you may want to consider downsampling.

If you have multiple GPUS at hand, kindly select a GPU which is not as busy as this one, (possible reasons, other processes are also running on this GPU). Go to terminal and type
export CUDA_VISIBLE_DEVICES=1
where 1 is the number of other GPU available. re-run the same code.
you can check the available GPUs using
nvidia-smi
this will show you what GPUs are available and how much memory is available on each one of them

Related

In PyTorch, can I load a tensor from file directly to the GPU, without using CPU memory?

I'm working on feature generation before I train a model in PyTorch. I wish to save my features as PyTorch tensors on disk for later use in training.
One of my features ("Feature A") is calculated on a CPU while another feature ("Feature B") must be calculated from that CPU on a GPU (some linear algebra stuff). I have an unusual limitation: on my university cluster, jobs which don't use GPUs have CPU memory limits of 1TB each while jobs which do use GPUs have CPU memory limits of 4GB with GPU memory limits of 48GB. Feature A and Feature B are each approximately 10GB.
Naturally, I want to first calculate Feature A using CPUs only then save Feature A to disk. In another job (this one with GPU access and thus the 4GB CPU memory limitation), I want to load Feature A directly to GPU, compute Feature B, then save Feature B to disk.
With Feature A computed and saved to disk, I've tried:
feaB = torch.load(feaAfile, map_location=torch.device('cuda'))
And yet I max-out my CPU memory. I've confirmed cuda is available.
In the PyTorch documentation I see that in loading tensors they "are first deserialized on the CPU..."
I wonder if there is any way to avoid a CPU memory implication when I want to load only onto the GPU? If the tensor must first be copied to the CPU, could I use some sort of 4GB buffer? Thanks so much in advance.
EDIT: per discussion in the comments, I no longer need to do this. But the question itself, of loading a tensor to the GPU without using CPU memory, remains unanswered so I'm leaving this question up.

Is it possible to store some tensors on CPU and other on GPU for training neural network in PyTorch?

I designed a neural network in PyTorch, which is demanding a lot of GPU memory or else runs with a very small batch size.
The GPU Runtime error is causing due to three lines of code, which stores two new tensors and does some operations.
I don't want to run my code with a small batch size. So, I want to execute those three lines of code (and hence storing those new tensors) on CPU and remaining all other code on GPU as usual.
Is it possible to do?
It is possible.
You can use the command .to(device=torch.device('cpu') to move the relevant tensors from GPU to CPU, and back to GPU afterwards:
orig_device = a.device # store the device from which the tensor originated
# move tensors a and b to CPU
a = a.to(device=torch.device('cpu'))
b = b.to(device=torch.device('cpu'))
# do some operation on a and b - it will be executed on CPU
res = torch.bmm(a, b)
# put the result back to GPU
res = res.to(device=orig_device)
A few notes:
Moving tensors between devices, or between GPU and CPU is not an unusual event. The term used to describe it is "model parallel" - you can google it for more details and examples.
Note that .to() operation is not an "in place" operation.
Moving tensor back and forth between GPU and CPU takes time. It might not be worthwhile using "model parallelism" of this type in this case. If you are struggling with GPU space, you might consider gradient accumulation instead

Does calling forward() on a model in pytorch require extra gpu memory after already having loaded the model and data in gpu memory?

I can load the model and a data sample in gpu memory, but when I call forward on the model with the sample, it gives a CUDA out of memory error.
I'm sure the model and data have been loaded, as my code is structured as follows (pseudocode):
model = Model()
sample = load_sample()
sleep(5) # to check memory usage with nvidia-smi
print('before forward')
model(sample)
print('after forward')
"before forward" gets printed, but "after forward" does not.
I assumed all the necessary memory for a forward pass gets allocated during construction of the model, but I don't know how else this error can happen. I also cannot find it on Google.
Python: 3.6.9
PyTorch: 1.2.0
It is not possible to determine the amount of space required to store the activations before runtime and hence GPU memory increases. Pytorch maintains a dynamic computation graph and hence the order of computations is not at all known before runtime. When you declare/initialize the model, only __init__ is called and model parameters are initialized. To figure out the graph one would need to look at the forward call and maybe also loss function (if it is not within forward call).
Let's say we can look at the forward call before running the model but still the batch size is unknown and hence memory can't be pre-allocated for activations.
Even if the batch size is known, there could be other unknowns like sequence size (for RNN), or episode size in RL that make it hard to pre-allocate memory for activations. Even if we account for all this at the declaration, pytorch naturally allows for for-loops which makes it almost impossible to pre-allocate space for activations and hence GPU memory can increase during runtime depending on the use case.
As Umang Gupta pointed out in the comments, GPU memory will increase during a forward() call on a Pytorch model, as (possibly amongst others) the batch size is not known before runtime. Therefore the required memory cannot be reserved beforehand, and the GPU memory can increase after having loaded the model and data already.

How to free the consumption of memory from Memory Pool for Training on GPU?

Allocation of Memory using cupy throws an Out of Memory allocation problem.Consumption of memory out of 12 GB gets almost completed before even starting training. Now,during training all the memory is consumed.
Everything is working fine on CPU.
I have tried reducing batch size from 100 to single digit(4).
I have changed the small task into numpy since cupy consumes memory.
I have tried using more than one gpu but encounter same problem
please_refer_this
Please note, It work's on CPU *
Results are fine but I need to trained more for that using cupy and GPU is necessary.

Memory Estimation for Convolution Neural Network in Tensorflow

Hello Everyone,
I am working on a Image classification problem using tensorflow and Convolution Neural Network.
My model is having following layers.
Input image of size 2456x2058
3 convolution Layer {Con1-shape(10,10,1,32); Con2-shape(5,5,32,64); Con3-shape(5,5,64,64)}
3 max pool 2x2 layer
1 fully connected layer.
I have tried using the NVIDIA-SMI tool but it shows me the GPU memory consumption as the model runs.
I would like to know if there is any method or a way to find the estimate of memory before running the model on GPU. So that I can design models with the consideration of available memory.
I have tried using this method for estimation but my calculated memory and observed memory utilisation are no where near to each other.
Thank you all for your time.
As far as I understand, when you open a session with tensorflow-gpu, it allocates all the memory in the GPUS that are available. So, when you look at the nvidia-smi output, you will always see the same amount of used memory, even if it actually uses only a part of it. There are options when opening a session to force tensorflow to allocate only a part of the available memory (see How to prevent tensorflow from allocating the totality of a GPU memory? for instance)
You can control the memory allocation of GPU in TensorFlow. Once you calculated your memory requirements for your Deep learning model you can use tf.GPUOptions.
For example if you want to allocate 4 GB(approximately) of GPU memory out of 8 GB.
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)
Once done pass it in tf.Session using config parameter
The per_process_gpu_memory_fraction is used to bound the available amount of GPU memory.
Here's the link to documentation :-
https://www.tensorflow.org/tutorials/using_gpu
NVIDIA-SMI ... shows me the GPU memory consumption as the model run
TF preallocates all available memory when you use it, so NVIDIA-SMI would show nearly 100% memory usage ...
but my calculated memory and observed memory utilisation are no where near to each other.
.. so this is unsurprising.

Resources