Hello Everyone,
I am working on a Image classification problem using tensorflow and Convolution Neural Network.
My model is having following layers.
Input image of size 2456x2058
3 convolution Layer {Con1-shape(10,10,1,32); Con2-shape(5,5,32,64); Con3-shape(5,5,64,64)}
3 max pool 2x2 layer
1 fully connected layer.
I have tried using the NVIDIA-SMI tool but it shows me the GPU memory consumption as the model runs.
I would like to know if there is any method or a way to find the estimate of memory before running the model on GPU. So that I can design models with the consideration of available memory.
I have tried using this method for estimation but my calculated memory and observed memory utilisation are no where near to each other.
Thank you all for your time.
As far as I understand, when you open a session with tensorflow-gpu, it allocates all the memory in the GPUS that are available. So, when you look at the nvidia-smi output, you will always see the same amount of used memory, even if it actually uses only a part of it. There are options when opening a session to force tensorflow to allocate only a part of the available memory (see How to prevent tensorflow from allocating the totality of a GPU memory? for instance)
You can control the memory allocation of GPU in TensorFlow. Once you calculated your memory requirements for your Deep learning model you can use tf.GPUOptions.
For example if you want to allocate 4 GB(approximately) of GPU memory out of 8 GB.
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)
Once done pass it in tf.Session using config parameter
The per_process_gpu_memory_fraction is used to bound the available amount of GPU memory.
Here's the link to documentation :-
https://www.tensorflow.org/tutorials/using_gpu
NVIDIA-SMI ... shows me the GPU memory consumption as the model run
TF preallocates all available memory when you use it, so NVIDIA-SMI would show nearly 100% memory usage ...
but my calculated memory and observed memory utilisation are no where near to each other.
.. so this is unsurprising.
Related
I'm working on feature generation before I train a model in PyTorch. I wish to save my features as PyTorch tensors on disk for later use in training.
One of my features ("Feature A") is calculated on a CPU while another feature ("Feature B") must be calculated from that CPU on a GPU (some linear algebra stuff). I have an unusual limitation: on my university cluster, jobs which don't use GPUs have CPU memory limits of 1TB each while jobs which do use GPUs have CPU memory limits of 4GB with GPU memory limits of 48GB. Feature A and Feature B are each approximately 10GB.
Naturally, I want to first calculate Feature A using CPUs only then save Feature A to disk. In another job (this one with GPU access and thus the 4GB CPU memory limitation), I want to load Feature A directly to GPU, compute Feature B, then save Feature B to disk.
With Feature A computed and saved to disk, I've tried:
feaB = torch.load(feaAfile, map_location=torch.device('cuda'))
And yet I max-out my CPU memory. I've confirmed cuda is available.
In the PyTorch documentation I see that in loading tensors they "are first deserialized on the CPU..."
I wonder if there is any way to avoid a CPU memory implication when I want to load only onto the GPU? If the tensor must first be copied to the CPU, could I use some sort of 4GB buffer? Thanks so much in advance.
EDIT: per discussion in the comments, I no longer need to do this. But the question itself, of loading a tensor to the GPU without using CPU memory, remains unanswered so I'm leaving this question up.
I can load the model and a data sample in gpu memory, but when I call forward on the model with the sample, it gives a CUDA out of memory error.
I'm sure the model and data have been loaded, as my code is structured as follows (pseudocode):
model = Model()
sample = load_sample()
sleep(5) # to check memory usage with nvidia-smi
print('before forward')
model(sample)
print('after forward')
"before forward" gets printed, but "after forward" does not.
I assumed all the necessary memory for a forward pass gets allocated during construction of the model, but I don't know how else this error can happen. I also cannot find it on Google.
Python: 3.6.9
PyTorch: 1.2.0
It is not possible to determine the amount of space required to store the activations before runtime and hence GPU memory increases. Pytorch maintains a dynamic computation graph and hence the order of computations is not at all known before runtime. When you declare/initialize the model, only __init__ is called and model parameters are initialized. To figure out the graph one would need to look at the forward call and maybe also loss function (if it is not within forward call).
Let's say we can look at the forward call before running the model but still the batch size is unknown and hence memory can't be pre-allocated for activations.
Even if the batch size is known, there could be other unknowns like sequence size (for RNN), or episode size in RL that make it hard to pre-allocate memory for activations. Even if we account for all this at the declaration, pytorch naturally allows for for-loops which makes it almost impossible to pre-allocate space for activations and hence GPU memory can increase during runtime depending on the use case.
As Umang Gupta pointed out in the comments, GPU memory will increase during a forward() call on a Pytorch model, as (possibly amongst others) the batch size is not known before runtime. Therefore the required memory cannot be reserved beforehand, and the GPU memory can increase after having loaded the model and data already.
I am currently trying to use the vgg16 model from keras library but whenever I create an object of the VGG16 model by doing
from keras.applications.vgg16 import VGG16
model = VGG16()
I get the following message 3 times.
tensorflow/core/framework/allocator.cc.124 allocation of 449576960 exceeds 10% of system memory
following this, my computer freezes. I am using a 64-bit, 4gb RAM with linux mint 18 and I have no access to GPU.
Is this problem has to do something with my RAM?
As a temporary solution I am running my python scripts from command line because my computer freezes less there compared to any IDE. Also, this does not happen when I use any alternate model like InceptionV3.
I have tried the solution provided here
but it didn't work
Any help is appreciated.
You are most likely running out of memory (RAM).
Try running top (or htop) in parallel and see your memory utilization.
In general, VGG models are rather big and require a decent amount of RAM. That said, the actual requirement depends on batch size. Smaller batch means smaller activation layer.
For example, a 6 image batch would consume about a gig of ram (reference). As a test you could lower your batch size to 1 and see it that fits in your memory.
i am facing issue with my inception model during the performance testing with Apache JMeter.
Error: OOM when allocating tensor with shape[800,1280,3] and type
float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator
GPU_0_bfc [[Node: Cast = CastDstT=DT_FLOAT, SrcT=DT_UINT8,
_device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Hint: If you want to see a list of allocated tensors when OOM happens,
add report_tensor_allocations_upon_oom to RunOptions for current
allocation info.
OOM stands for Out Of Memory. That means that your GPU has run out of space, presumably because you've allocated other tensors which are too large. You can fix this by making your model smaller or reducing your batch size. By the looks of it, you're feeding in a large image (800x1280) you may want to consider downsampling.
If you have multiple GPUS at hand, kindly select a GPU which is not as busy as this one, (possible reasons, other processes are also running on this GPU). Go to terminal and type
export CUDA_VISIBLE_DEVICES=1
where 1 is the number of other GPU available. re-run the same code.
you can check the available GPUs using
nvidia-smi
this will show you what GPUs are available and how much memory is available on each one of them
I am running a very large Tensorflow model on google cloud ml-engine.
When using the scale tier basic_gpu (with batch_size=1) I get errors like:
Resource exhausted: OOM when allocating tensor with shape[1,155,240,240,16]
because the model is too large to fit in one GPU.
Using the tier comple_model_m_gpu which provides 4 GPUs I can spread the operations between the 4 GPUs.
However, I remember reading that communication between GPUs is slow and can create a bottleneck in training. Is this true?
If so, is there a recommended way of spreading operations between the GPUs that prevents this problem?
I recommend the following guide:
Optimizing for GPU
From the guide:
The best approach to handling variable updates depends on the model,
hardware, and even how the hardware has been configured.
A few suggestions based on the guide:
Try using P100s which have 16 GB of RAM (compared to 12 on the K80s). They are also significantly faster, although they also cost more
Place the variables on CPU: tf.train.replica_device_setter(worker_device=worker, ps_device='/cpu:0', ps_tasks=1)
Using Tesla P100 GPUs instead of Tesla K80 GPUs fixes this issue because P100s have something called Page Migration Engine.
Page Migration Engine frees developers to focus more on tuning for
computing performance and less on managing data movement. Applications
can now scale beyond the GPU's physical memory size to virtually
limitless amount of memory.