I am running a code in Google Colab for training a neural network.
All my scripts have been working just fine, but starting this week, I have been receiving this error:
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
which seems to occur at random. Sometimes it occurs at the beginning of my script run, say, even before epoch 1, some other times at epoch 160 or 56 or so. Nonetheless, it seems to always point to this sentence: loss.backward().
I'm running the code over GPU and have the paid subscription to Colab Pro.
Does anybody have faced this issue? I read somewhere that this seems to be a problem of the GPU running out of memory, however, can't say that for sure given the error messages I'm receiving.
Well, It took a while but I managed to find the source of this problem myself. Some other posts mentioned this could be a GPU memory issue so I tried to minimize the memory usage as much as possible. Though this was good for my code, it didn't solve the problem.
Others talked about switching to CPU and running the script to get a better error message (which I did and took forever). Running my script with CPU gave the error of binary cross entropy not receiving inputs in the zero to one interval. This was clearly not the problem since those inputs can from a sigmoid function.
Finally, I recall the last thing I changed before my script started behaving like this and I found out that it was because of the learning rate. When I ran my training with a learning rate of 0.001, everything was fine. I switched it to 0.02 (20 times higher) and then I started receiving this execution errors at random. Switching back to the smaller learning rate solved the problem immediately. No more GPU errors and now I'm happy.
So, if you have this issue, you my take a look to the learning rate and hopefully will help you.
Related
I ran some code last week using some GPUs my lab has. When I ran it again today (exact same code), it seems to not work. I checked with print statements and it makes it to the line model=model.to(device) where device=torch.cuda(cuda:i) for some index i. Once it encounters this line, it doesn't give any errors but just sits there and does nothing indefinitely. Does anyone have any idea why this may be happening? Is it just an issue with our machines?
Currently trying to run a basic similarity search via FAISS with reproducible code from that link. However, every time I run the code in the following venues, I have these problems:
Jupyter notebook - kernel crashes
VS Code - receive "Illegal Instruction" message in the terminal with no further documentation
I've got similar code working in Kaggle, so I suppose the problem is with my particular setup.
Based on the print statements, it appears that the error occurs during the call of the .search method. Because of how vague this error is, I've not been able to find much information on the problem. It seems that some people mentioned older processors may have a problem (AVX/AVX2 flags being the culprit?), though admittedly I didn't quite understand the connections.
Problem: Can I get some help understanding this error, and if possible, a potential solution?
Current setup:
WSL2
VSCODE (v. 1.49.0)
Jupyter-client (v. 6.1.7)
Jupyter-core (v. 4.6.3)
FAISS-cpu (v. 1.6.3)
Numpy (v. 1.19.2)
Older machine (AMD FX-8350 with 16GB RAM)
For anyone that runs across this error, the problem (in my case) was that my CPU was old enough that it doesn't support AVX2. To determine this, I used this SO post.
Once I ran the code in Colab or on a newer machine, all was well.
I train a network on GPU with Pytorch. However, after at most 3 epochs, code stops with a message :
Killed
No other error message is given.
I monitored the memory and gpu usage, there were still space during the run. I reviewed the /var/sys/dmesg to find a detailed message regarding to this, however no message with "kill" was entered. What might be the problem?
Cuda version: 9.0
Pytorch version: 1.1.0
If you had root access, you could check whether this is memory issue or not by dmesg command.
In my case, the process was killed by kernel due to out of memory.
I found the cause to be saving tensors require grad to a list and each of those stores an entire computation graph, which consumes significant memory.
I fixed the issue by saving .detach() tensor instead of saving tensors returned by loss function to the list.
You can type "dmesg" on your terminal and scroll down to the bottom. It will show you the message of why it is killed.
Since you mentioned PyTorch, the chances are that your process is killed due to "Out of Memory". To resolve this, reduce your batch size till you no longer see the error.
Hope this helps! :)
In order to give an idea to people who will enconter this:
Apparently, Slurm was installed on the machine so that I needed to give the tasks on Slurm.
Whatever program I ran on GPU, even if programs that ran successfully before, my GPU throws this error: CL_OUT_OF_RESOURCES for the clEnqueueReadBuffer function.
Then I remembered that I ran a deep learning framework last night which crashed and may ate up all the memory on GPU. I tried to restart the computer, but it doesn't work.
Is it possible that my GPU ran out of memory due to the DL framework's crash?
If so, how should I solve this problem?
CL_OUT_OF_RESOURCES is a generic error given by NVIDIA implementation at clEnqueueRead, it more or less means:
Something went out of bounds (resources) when trying to write to this
buffer
Most probably the kernel you launched before that writes to that buffer went out of bounds of the buffer.
I have posted on this before, but thought I had tracked it down to the NW extension, however, memory leakage still occurs in the latest version. I found this thread, which discusses a similar issues, but attributes it to Behavior Space:
http://netlogo-users.18673.x6.nabble.com/Behaviorspace-Memory-Leak-td5003468.html
I have found the same symptoms. My model starts out at around 650mb, but over each run the private working set memory rises, to the point where it hits the 1024 limit. I have sufficient memory to raise this, but in reality it will only delay the onset. I am using the table output, as based on previous discussions this helps, and it does, but it only slows the rate of increase. However, eventually the memory usage rises to a point where the PC starts to struggle. I am clearing all data between runs so there should be no hangover. I noticed in the highlighted thread that they were going to run headless. I will try this, but I wondered if anyone else had noticed the issue? My other option is to break the BehSpc simulation into a few batches so the issues never arises, bit i would be nice to let the model run and walk away as it takes around 2 hours to go through.
Some possible next steps:
1) Isolate the exact conditions under which the problem does or not occur. Can you make it happen without involving the nw extension, or not? Does it still happen if you remove some of the code from your model? What if you keep removing code — when does the problem go away? What is the smallest code that still causes the problem? Almost any bug can be demonstrated with only a small amount of code — and finding that smallest demonstration is exactly what is needed in order to track down the cause and fix it.
2) Use standard memory profiling tools for the JVM to see what kind of objects are using the memory. This might provide some clues to possible causes.
In general, we are not receiving other bug reports from users along these lines. It's routine, and has been for many years now, for people to use BehaviorSpace (both headless and not) and do experiments that last for hours or even for days. So whatever it is you're experiencing almost certainly has a more specific cause -- mostly likely, in the nw extension -- that could be isolated.