When I run my code, I get this message every time:
2018-09-27 19:31:03.353933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 650 Ti major: 3 minor: 0 memoryClockRate(GHz): 0.941
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.65GiB
2018-09-27 19:31:03.355743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-27 19:31:04.822514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-27 19:31:04.822895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-09-27 19:31:04.823072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2018-09-27 19:31:04.823679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1416 MB memory) -> physical GPU (device: 0, name: GeForce GTX 650 Ti, pci bus id: 0000:01:00.0, compute capability: 3.0)
2018-09-27 19:31:12.050251: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 261.79MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-09-27 19:31:17.191146: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
The last 2 messages, which are warnings, seem odd to me: I supposedly have 1.65GiB of free memory, yet some smaller amounts cannot be allocated. What could I do to fix it? What is the source of this message? But also: why can't I get more than 50% usage from my GPU ?
Here is what it looks like when I start training:
The code itself is in my repo (it's hard for me to know which parts of my code are relevant).
Looks like you're not using a multi-gpu model?
see for example https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/
Have you tried increasing the batch size? I see from your code that you use batch size = 1
Related
Today i upgraded my account to Colab pro. Although it prints the ram as:
Your runtime has 27.3 gigabytes of available RAM
You are using a high-RAM runtime!
when I start training my model, it gives the error below.
RuntimeError: CUDA out of memory. Tried to allocate 88.00 MiB (GPU 0; 15.90 GiB total capacity; 14.75 GiB already allocated; 75.75 MiB free; 14.95 GiB reserved in total by PyTorch)
Hyperparameters of my model:
args_dict = dict(
#data_dir="", # path for data files
output_dir="", # path to save the checkpoints
model_name_or_path='t5-large',
tokenizer_name_or_path='t5-large',
max_seq_length=600,
learning_rate=3e-4,
weight_decay=0.0,
adam_epsilon=1e-8,
warmup_steps=0,
train_batch_size=4,
eval_batch_size=4,
num_train_epochs=2,
gradient_accumulation_steps=16,
n_gpu=1,
early_stop_callback=False,
fp_16=True, # if you want to enable 16-bit training then install apex and set this to true
opt_level='O1', # you can find out more on optimisation levels here https://nvidia.github.io/apex/amp.html#opt-levels-and-properties
max_grad_norm=1.0, # if you enable 16-bit training then set this to a sensible value, 0.5 is a good default
seed=42,
)
Colab pro not providing all ram. My code only works if train_batch_size = 1. What causes this? Any ideas?
Note: I get the same error when I run the code in Kaggle (16Gb). So, what I get with colab pro?
Looking at your error, the 16 GB are referring to the graphics card, not the ram.
As far as I know, using colab-pro enables you to use a graphics card with up to 16GB of VRAM.
You can check the VRAM amount by running the following code.
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
print('and then re-execute this cell.')
else:
print(gpu_info)
Maybe you use a smaller batch size than 4?
My Google colab crashes immediately as soon as it starts training on tiny-imagenet with 0.1 million images and 200 classes of size 64*64
Colab log shows
WARNING:root:kernel 1fe0be22-c98a-4519-a16a-69c9fb4be1da restarted
KernelRestarter: restarting kernel (1/5), keep random ports
tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10754 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
I am using model.fit_generator with batch size (tried from 32 till 1024) and size of image ( tried from 16 till 64) but nothing works.
I tried resnet-18 architecture with (1.8*10^9 params) as well as custom model too with 0.8 million params but in vain.
I am pasting the link to my colab in case anybody needs some other info
https://colab.research.google.com/drive/1QG1mg1zOn6gZaaSv4rrI4F6erdxsxQ8V#scrollTo=Uy0M-VDHivOX
Recently, I try to repeat the deep learning experiment in Github. However, every time I run that experiment, I will receive the following error information.
2018-08-27 09:32:16.827025: E T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
In this situation, I set the session in Tensorflow as the following.
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=False))
If I try to limit the GPU memory as the following, I find that I do not have enough memory to run my model.
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
The information about my GPU is as the following. I am not sure where the problem is and I have met such problems several times. Thank you for your contribution!
2018-08-27 09:31:45.966248: IT:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-08-27 09:31:46.199314: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.09GiB
sean. According to the documentation. The error status CUDNN_STATUS_ALLOC_FAILED is due problem with the host memory and not the device memory. Check your RAM also.
In my case, this was due to running 2 TensorFlow processes using the GPU simultaneously (either by you or by other users): https://stackoverflow.com/a/53707323/10993413
Source: https://forums.developer.nvidia.com/t/could-not-create-cudnn-handle-cudnn-status-alloc-failed/108261
I am building a keras model to run some simple image recognition task. If i do everything in raw Keras, I don't hit OOM. However, strangely, when I do it through a mini framework I wrote, which is fairly simple and mainly so that I can keep track of the hyperparameters and setup I used, I hit OOM. Most of the executions should be the same as running the raw Keras. I am guessing somewhere I made some mistakes in my code. Note that this same mini framework had no issue running with CPU on my local laptop. I think I will need to debug. But before that, anyone has any general advice?
Here's a few lines of the errors I got:
Epoch 1/50
2018-05-18 17:40:27.435366: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-05-18 17:40:27.435906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 pciBusID: 0000:00:04.0 totalMemory: 11.17GiB freeMemory: 504.38MiB
2018-05-18 17:40:27.435992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-05-18 17:40:27.784517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-18 17:40:27.784675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-05-18 17:40:27.784724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-05-18 17:40:27.785072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 243 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
2018-05-18 17:40:38.569609: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 36.00MiB. Current allocation summary follows.
2018-05-18 17:40:38.569702: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 66, Chunks in use: 66. 16.5KiB allocated for chunks. 16.5KiB in use in bin. 2.3KiB client-requested in use in bin.
2018-05-18 17:40:38.569768: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (512): Total Chunks: 10, Chunks in use: 10. 5.0KiB allocated for chunks. 5.0KiB in use in bin. 5.0KiB client- etc. etc
2018-05-18 17:40:38.573706: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[18432,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
This is caused by running out of GPU memory as it is clear from the Warnings.
First workaround is that you can allow GPU memory to grow if possible by writing making this Config proto and passing to tf.session()
# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
Then pass this config to the session that is causing this error. Like
tf.Session(config = config)
If this doesn't helps, you could disable the GPU for that particular session that is causing this error. Like this
config = tf.ConfigProto(device_count ={'GPU': 0})
sess = tf.Session(config=config)
If you are using keras, you can get the backends of keras and apply these configs by extracting the session.
I've used YOLO detection with trained model using my GPU - Nvidia 1060 3Gb, and everything worked fine.
Now I am trying to generate my own model, with param --gpu 1.0. Tensorflow can see my gpu, as I can read at start those communicates:
"name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705"
"totalMemory: 3.00GiB freeMemory: 2.43GiB"
Anyway, later on, when program loads data, and is trying to start learning i got following error:
"failed to allocate 832.51M (872952320 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY"
I've checked if it tries to use my other gpu (Intel 630) , but it doesn't.
As i run the train process without "--gpu" mode option, it works fine, but slowly.
( I've tried also --gpu 0.8, 0.4 ect.)
Any idea how to fix it?
Problem solved. Changing batch size and image size in config file didn't seem to help as they didn't load correctly. I had to go to defaults.py file and change them up there to lower, to make it possible for my GPU to calculate the steps.
Look like your custom model use to much memory and the graphic card cannot support it. You only need to use the --batch option to control the size of memory.