Unable to set model to CUDA device - pytorch

I ran some code last week using some GPUs my lab has. When I ran it again today (exact same code), it seems to not work. I checked with print statements and it makes it to the line model=model.to(device) where device=torch.cuda(cuda:i) for some index i. Once it encounters this line, it doesn't give any errors but just sits there and does nothing indefinitely. Does anyone have any idea why this may be happening? Is it just an issue with our machines?

Related

TrueNAS shell gives ~~ at an interval and eventually causes a crash

After an update of my TrueNAS I started getting some strange double beeps.
I thought it might be thermal warning, so I cleaned my NAS PC, put a monitor and keyboard on and booted it up.
I started to see some strange token series popping up, seemingly random: ^[[6~^ ^[[6~^.
I thought nothing of it.
Then more beeps, system froze. I checked the monitor. It was flooded with ^[[6~^ ^[[6~^.
I then rebooted my TrueNAS and went into the shell by pressing 9.
Now I see: ~~ and the same beeps occur when the characters appear. Roughly around every 8 seconds.
What is causing these? I tried unplugging all USB devices, I even tried to google.
I found things like kbdcontrol, jons, crontab. But with my very limited Linux knowledge I could not make anything work.
Hoping someone can help me with this.
Something went wrong with the patching process (I guess)
How I fixed it:
In the GUI I went to System -> Boot
I reverted to the previous patch.
Rebooted the system.
Issue still occurred.
Then I went to Dashboard -> check for updates.
It then went on and installed the update (as before).
Now the issue is resolved.

Receiving error messages at random in Google Colab pro -Pytorch

I am running a code in Google Colab for training a neural network.
All my scripts have been working just fine, but starting this week, I have been receiving this error:
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
which seems to occur at random. Sometimes it occurs at the beginning of my script run, say, even before epoch 1, some other times at epoch 160 or 56 or so. Nonetheless, it seems to always point to this sentence: loss.backward().
I'm running the code over GPU and have the paid subscription to Colab Pro.
Does anybody have faced this issue? I read somewhere that this seems to be a problem of the GPU running out of memory, however, can't say that for sure given the error messages I'm receiving.
Well, It took a while but I managed to find the source of this problem myself. Some other posts mentioned this could be a GPU memory issue so I tried to minimize the memory usage as much as possible. Though this was good for my code, it didn't solve the problem.
Others talked about switching to CPU and running the script to get a better error message (which I did and took forever). Running my script with CPU gave the error of binary cross entropy not receiving inputs in the zero to one interval. This was clearly not the problem since those inputs can from a sigmoid function.
Finally, I recall the last thing I changed before my script started behaving like this and I found out that it was because of the learning rate. When I ran my training with a learning rate of 0.001, everything was fine. I switched it to 0.02 (20 times higher) and then I started receiving this execution errors at random. Switching back to the smaller learning rate solved the problem immediately. No more GPU errors and now I'm happy.
So, if you have this issue, you my take a look to the learning rate and hopefully will help you.

FAISS search fails with vague error: "Illegal instruction" or kernel crash

Currently trying to run a basic similarity search via FAISS with reproducible code from that link. However, every time I run the code in the following venues, I have these problems:
Jupyter notebook - kernel crashes
VS Code - receive "Illegal Instruction" message in the terminal with no further documentation
I've got similar code working in Kaggle, so I suppose the problem is with my particular setup.
Based on the print statements, it appears that the error occurs during the call of the .search method. Because of how vague this error is, I've not been able to find much information on the problem. It seems that some people mentioned older processors may have a problem (AVX/AVX2 flags being the culprit?), though admittedly I didn't quite understand the connections.
Problem: Can I get some help understanding this error, and if possible, a potential solution?
Current setup:
WSL2
VSCODE (v. 1.49.0)
Jupyter-client (v. 6.1.7)
Jupyter-core (v. 4.6.3)
FAISS-cpu (v. 1.6.3)
Numpy (v. 1.19.2)
Older machine (AMD FX-8350 with 16GB RAM)
For anyone that runs across this error, the problem (in my case) was that my CPU was old enough that it doesn't support AVX2. To determine this, I used this SO post.
Once I ran the code in Colab or on a newer machine, all was well.

Code failing to run the second time, but runs after _pycache_ folder is deleted?

I am getting an overflow error immediately after running the exact same code for the second time (the first run works as expected). However, deleting the associated pycache folder seems to reset everything and I can run the code once more.
What is even more confusing is that I do not encounter this problem at all when running the exact same code on a different computer. The only difference is that one computer is running on Windows 7, the other on Windows 10. I'm also not sure if one is running Idle 3.6 vs. 3.7, though I'm not sure if this is relevant.
Edit:
Warning (from warnings module):
File "C:\Users\...\AppData\Local\Programs\Python\Python37\Projects\New folder\double_pendulum.py", line 42
a = (-1/2)*(f1(th1,th2,p1,p2)*f2(th1,th2,p1,p2)*m.sin(th1-th2)+3*g*m.sin(th1))
RuntimeWarning: overflow encountered in double_scalars

Netlogo 5.1 (and 5.05) Behavior Space Memory Leak

I have posted on this before, but thought I had tracked it down to the NW extension, however, memory leakage still occurs in the latest version. I found this thread, which discusses a similar issues, but attributes it to Behavior Space:
http://netlogo-users.18673.x6.nabble.com/Behaviorspace-Memory-Leak-td5003468.html
I have found the same symptoms. My model starts out at around 650mb, but over each run the private working set memory rises, to the point where it hits the 1024 limit. I have sufficient memory to raise this, but in reality it will only delay the onset. I am using the table output, as based on previous discussions this helps, and it does, but it only slows the rate of increase. However, eventually the memory usage rises to a point where the PC starts to struggle. I am clearing all data between runs so there should be no hangover. I noticed in the highlighted thread that they were going to run headless. I will try this, but I wondered if anyone else had noticed the issue? My other option is to break the BehSpc simulation into a few batches so the issues never arises, bit i would be nice to let the model run and walk away as it takes around 2 hours to go through.
Some possible next steps:
1) Isolate the exact conditions under which the problem does or not occur. Can you make it happen without involving the nw extension, or not? Does it still happen if you remove some of the code from your model? What if you keep removing code — when does the problem go away? What is the smallest code that still causes the problem? Almost any bug can be demonstrated with only a small amount of code — and finding that smallest demonstration is exactly what is needed in order to track down the cause and fix it.
2) Use standard memory profiling tools for the JVM to see what kind of objects are using the memory. This might provide some clues to possible causes.
In general, we are not receiving other bug reports from users along these lines. It's routine, and has been for many years now, for people to use BehaviorSpace (both headless and not) and do experiments that last for hours or even for days. So whatever it is you're experiencing almost certainly has a more specific cause -- mostly likely, in the nw extension -- that could be isolated.

Resources