Recently, I try to repeat the deep learning experiment in Github. However, every time I run that experiment, I will receive the following error information.
2018-08-27 09:32:16.827025: E T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_dnn.cc:332] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
In this situation, I set the session in Tensorflow as the following.
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=False))
If I try to limit the GPU memory as the following, I find that I do not have enough memory to run my model.
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
The information about my GPU is as the following. I am not sure where the problem is and I have met such problems several times. Thank you for your contribution!
2018-08-27 09:31:45.966248: IT:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-08-27 09:31:46.199314: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1392] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.09GiB
sean. According to the documentation. The error status CUDNN_STATUS_ALLOC_FAILED is due problem with the host memory and not the device memory. Check your RAM also.
In my case, this was due to running 2 TensorFlow processes using the GPU simultaneously (either by you or by other users): https://stackoverflow.com/a/53707323/10993413
Source: https://forums.developer.nvidia.com/t/could-not-create-cudnn-handle-cudnn-status-alloc-failed/108261
Related
I have a user with two GPU's; the first one is AMD which can't run CUDA, and the second one is a cuda-capable NVIDIA GPU. I am using the code model.half().to("cuda:0"). I'm not sure if the invocation successfully used the GPU, nor am I able to test it because I don't have any spare computer with more than 1 GPU lying around.
In this case, does "cuda:0" mean the first device which can run CUDA, so it would've worked even if their first device was AMD? Or would I need to say "cuda:1" instead? How would I detect which number is the first CUDA-capable device?
The package nvidia-smi can help to track GPU's memory while running your code.
To install, run pip install nvidia-ml-py3. Take a look at this code snip:
import nvidia_smi
cuda_idx = 0 # edit device index that you want to track
to_cuda = f'cuda:{cuda_idx}' # 'cuda:0' in this case
nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(cuda_idx)
def B2G(num):
return round(num/(1024**3),2)
def print_memory(name, handle, pre_used):
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
used = info.used
print(f'{name}: {B2G(used)}')
print(f'This step use: {B2G(used-pre_used)}')
print('------------')
return used
# start
mem = print_memory('Start', handle, 0)
model = ... # init your model
model.to(to_cuda)
mem = print_memory('Init model', handle, mem)
Above is the example with nvidia-smi that can help to track the memory that needs for each part of the model and print it in GB unit.
Edited: To check the list of GPUs:
def check_gpu():
for i in range(torch.cuda.device_count()):
device_name = f'cuda:{i}'
print(f'{i} device name:{torch.cuda.get_device_name(torch.device(device_name))}')
I tested it and as I suspected the model.half().to("cuda:0") will put your model in the first available GPU with CUDA support i.e. NVIDIA GPU in your case, the AMD GPU isn't visible as a cuda device, feel safe to assume cuda:0 is only a CUDA enabled GPU, and AMD GPU won't be seen by your program.
Have a good day.
There are plenty of methods of torch.cuda to query and monitor GPU devices.
For example, you can check the type of each device:
torch.cuda.get_device_name(torch.device('cuda:0'))
% or
torch.cuda.get_device_name(torch.device('cuda:1'))
In my case, the output of get_device_name returns:
'Quadro RTX 6000'
If you want a more programmatic way to explore the properties of your devices, you can use torch.cuda.get_device_properties.
Once you are working with a device (or believe you are), you can use [torch.cuda]'s memory management functions to monitor GPU memory usage.
For instance, you can get a very detailed account of the current state of your device's memory using:
torch.cuda.memory_stats(torch.device('cuda:0'))
% or
torch.cuda.memory_stats(torch.device('cuda:0'))
If you want nvidia-smi-like stats on utilization, you can use torch.cuda.utilization
When I run my code, I get this message every time:
2018-09-27 19:31:03.353933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 650 Ti major: 3 minor: 0 memoryClockRate(GHz): 0.941
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.65GiB
2018-09-27 19:31:03.355743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-27 19:31:04.822514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-27 19:31:04.822895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-09-27 19:31:04.823072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2018-09-27 19:31:04.823679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1416 MB memory) -> physical GPU (device: 0, name: GeForce GTX 650 Ti, pci bus id: 0000:01:00.0, compute capability: 3.0)
2018-09-27 19:31:12.050251: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 261.79MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-09-27 19:31:17.191146: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
The last 2 messages, which are warnings, seem odd to me: I supposedly have 1.65GiB of free memory, yet some smaller amounts cannot be allocated. What could I do to fix it? What is the source of this message? But also: why can't I get more than 50% usage from my GPU ?
Here is what it looks like when I start training:
The code itself is in my repo (it's hard for me to know which parts of my code are relevant).
Looks like you're not using a multi-gpu model?
see for example https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/
Have you tried increasing the batch size? I see from your code that you use batch size = 1
I am building a keras model to run some simple image recognition task. If i do everything in raw Keras, I don't hit OOM. However, strangely, when I do it through a mini framework I wrote, which is fairly simple and mainly so that I can keep track of the hyperparameters and setup I used, I hit OOM. Most of the executions should be the same as running the raw Keras. I am guessing somewhere I made some mistakes in my code. Note that this same mini framework had no issue running with CPU on my local laptop. I think I will need to debug. But before that, anyone has any general advice?
Here's a few lines of the errors I got:
Epoch 1/50
2018-05-18 17:40:27.435366: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-05-18 17:40:27.435906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 pciBusID: 0000:00:04.0 totalMemory: 11.17GiB freeMemory: 504.38MiB
2018-05-18 17:40:27.435992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-05-18 17:40:27.784517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-18 17:40:27.784675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-05-18 17:40:27.784724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-05-18 17:40:27.785072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 243 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
2018-05-18 17:40:38.569609: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 36.00MiB. Current allocation summary follows.
2018-05-18 17:40:38.569702: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 66, Chunks in use: 66. 16.5KiB allocated for chunks. 16.5KiB in use in bin. 2.3KiB client-requested in use in bin.
2018-05-18 17:40:38.569768: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (512): Total Chunks: 10, Chunks in use: 10. 5.0KiB allocated for chunks. 5.0KiB in use in bin. 5.0KiB client- etc. etc
2018-05-18 17:40:38.573706: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[18432,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
This is caused by running out of GPU memory as it is clear from the Warnings.
First workaround is that you can allow GPU memory to grow if possible by writing making this Config proto and passing to tf.session()
# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
Then pass this config to the session that is causing this error. Like
tf.Session(config = config)
If this doesn't helps, you could disable the GPU for that particular session that is causing this error. Like this
config = tf.ConfigProto(device_count ={'GPU': 0})
sess = tf.Session(config=config)
If you are using keras, you can get the backends of keras and apply these configs by extracting the session.
I've used YOLO detection with trained model using my GPU - Nvidia 1060 3Gb, and everything worked fine.
Now I am trying to generate my own model, with param --gpu 1.0. Tensorflow can see my gpu, as I can read at start those communicates:
"name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705"
"totalMemory: 3.00GiB freeMemory: 2.43GiB"
Anyway, later on, when program loads data, and is trying to start learning i got following error:
"failed to allocate 832.51M (872952320 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY"
I've checked if it tries to use my other gpu (Intel 630) , but it doesn't.
As i run the train process without "--gpu" mode option, it works fine, but slowly.
( I've tried also --gpu 0.8, 0.4 ect.)
Any idea how to fix it?
Problem solved. Changing batch size and image size in config file didn't seem to help as they didn't load correctly. I had to go to defaults.py file and change them up there to lower, to make it possible for my GPU to calculate the steps.
Look like your custom model use to much memory and the graphic card cannot support it. You only need to use the --batch option to control the size of memory.
I have two computers with the same GPU(GTX 1080), installed the same copy of OS and softwares. But when I run my tensorflow program(an RNN model), the speed are very different. One is about 1.5x faster than the other.
Here are the key specs of the two:
SystemA: Asus Z170-P, i7 6700T, 32GB Ram, GTX 1080.
SystemB: Asus X99 E-WS, i7 5930K, 128G Ram, GTX 1080. (Problem one)
Both are installed with(using the same method):
OS: Ubuntu 16.04
GPU driver version: 378.13
Cuda version: 8.0
cuDNN version: 5.1
Tensorflow: installed using method pip install tensorflow-gpu==1.0.1
Python: Anaconda 3.6
Sample code:
import tensorflow as tf
import numpy as np
from tqdm import trange
h,w = 3000, 2000
steps = 1000
x = tf.placeholder(dtype=tf.float32, shape=[h, w], name='x')
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
m = tf.matmul(x,t)
x0 = np.random.random(size=[h, w])
sess = tf.Session()
for i in trange(steps):
x0 = sess.run(m, feed_dict={x: x0})
SystemA performs 75 iter/sec and systemB only has 50 iter/sec, yes the poorer one is actually faster.
Key observations:
SystemB have a much larger page fault while running the program.
By monitoring the Volatile GPU-Util from nvidia-smi, systemA stably seat at about 40% while systemB is about 30%.
Things I have tried on systemB:
Upgrade BIOS to the latest version and reset default settings.
Call Asus customer service for help.
Swap GPU card with system A.
Change PCI-e slot to make sure it running at x16 gen3.
Inject LD_PRELOAD="/usr/lib/libtcmalloc.so" to .bashrc file.
The main differences of the output of /usr/bin/time -v are:
# The first value is for systemB and the second is for systemA.
System time (seconds): 7.28 2.95
Percent of CPU this job got: 85% 106%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.41 0:14.89
Minor (reclaiming a frame) page faults: 684695 97853
Involuntary context switches: 164 91063
File system inputs: 0 24
File system outputs: 8 0
Can anybody point me to a direction of how to profile/debug this issue? Many thanks in advance!
There is a chance that you may not be using GPUs. To test this use
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
to display what devices you are using.
If indeed you are using CPU, then you can add the following before your tensorflow code
with tf.device('/gpu:0'): # NEW LINE
x = tf.placeholder(dtype=tf.float32, shape=[h, w], name='x')
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
m = tf.matmul(x,t)
If this isn't the case, add a comment with your results and I'll follow up to see what else I can do.
According to some sources tf.constant is a GPU memory hog. try replacing
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
with
t = tf.Variable(np.random.random(size=[w, w]), dtype=tf.float32)
trying a model without network traffic
import tensorflow as tf
import numpy as np
from tqdm import trange
h,w = 3000, 2000
steps = 1000
x = tf.random_normal( [h, w] , dtype=tf.float32 )
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
m = tf.matmul(x,t)
s = tf.reduce_mean( tf.reduce_mean( m ) )
sess = tf.Session()
for i in range(steps):
sess.run(s)
Results of Experimentation with Xer
After much discussion and trouble shooting, it has become apparent that indeed the two machines are different. The Nvidia cards were swapped which resulted in no change. They have 2 different CPUs, one with a graphics processor built in and 1 without. One with a higher CPU, one without. I suggested that machine with an onboard graphics on the i7 had the OSs graphic windowing system disabled to make sure that the test is unused GPU vs unused GPU. The problem persisted.
The original problem that was posted creates huge amounts of data traffic across the main BUS from the CPU to the Nvidia GPUs as can be seen here
Tx Throughput : 75000 KB/s
Rx Throughput : 151000 KB/s
We experimented with changing the size of the problem from W=2000, W=200, and W=1000 and found that when W was small enough that the two machines performed nearly identically. W though not only controls the size of the problem on the GPU but also the amount of traffic between the CPU and the GPU.
Although we did find a solution or an exact model, I believe that after much exploration with #Xer I can say with confidence that the two systems are not the same and their physical difference (BUS+CPU) makes the performance difference.