Optuna - Memory Issues - memory-leaks

I am trying to free memory in between Optuna optimization runs. I am using python 3.8 and the latest version of Optuna. What happens is I run the commands: optuna.create_study(), then I call optuna.optimize(...) in a loop, with a new objective function each time. When I monitor my memory usage, each time the command optuna.create_study() is called, memory usage keeps on increasing to the point that my processor just kills the program eventually. Just for a more clear picture, the first run takes over 3% memory and it eventually builds up to >80%. Any thoughts on how I can remove a study from memory in between successive calls of create_study()?

I had a similar problem working with pytorch. Following the optuna docs https://optuna.readthedocs.io/en/stable/faq.html#how-do-i-avoid-running-out-of-memory-oom-when-optimizing-studies , I tried with this solution:
study.optimize(objective, n_trials=n_trials, gc_after_trial=True)
which should be similar to
import gc
study.optimize(objective, n_trials=n_trials, callbacks=[lambda study, trial: gc.collect()])
However, none of them worked for me. The only way I could fix it was just by upgrading the pytorch version to the newest one. I do not know if you are using pytorch or other ML packages maybe, but you may need to update the appropriate packages in case the above lines do not work.

I had a similar problem when running my trial (although I did not loop over multiple optimizations). Creating a callback that frees garbage at the end of each epoch solved my problem and will probably already help you free quite some space. Try the following:
import tensorflow as tf
import gc
class MyCustomCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
gc.collect()
...
_ = model.fit(x_train, y_train, ...
callbacks=[MyCustomCallback()])```
Solution based on this issue.

Related

Receiving error messages at random in Google Colab pro -Pytorch

I am running a code in Google Colab for training a neural network.
All my scripts have been working just fine, but starting this week, I have been receiving this error:
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
which seems to occur at random. Sometimes it occurs at the beginning of my script run, say, even before epoch 1, some other times at epoch 160 or 56 or so. Nonetheless, it seems to always point to this sentence: loss.backward().
I'm running the code over GPU and have the paid subscription to Colab Pro.
Does anybody have faced this issue? I read somewhere that this seems to be a problem of the GPU running out of memory, however, can't say that for sure given the error messages I'm receiving.
Well, It took a while but I managed to find the source of this problem myself. Some other posts mentioned this could be a GPU memory issue so I tried to minimize the memory usage as much as possible. Though this was good for my code, it didn't solve the problem.
Others talked about switching to CPU and running the script to get a better error message (which I did and took forever). Running my script with CPU gave the error of binary cross entropy not receiving inputs in the zero to one interval. This was clearly not the problem since those inputs can from a sigmoid function.
Finally, I recall the last thing I changed before my script started behaving like this and I found out that it was because of the learning rate. When I ran my training with a learning rate of 0.001, everything was fine. I switched it to 0.02 (20 times higher) and then I started receiving this execution errors at random. Switching back to the smaller learning rate solved the problem immediately. No more GPU errors and now I'm happy.
So, if you have this issue, you my take a look to the learning rate and hopefully will help you.

PyTorch code stops with message "Killed". What killed it?

I train a network on GPU with Pytorch. However, after at most 3 epochs, code stops with a message :
Killed
No other error message is given.
I monitored the memory and gpu usage, there were still space during the run. I reviewed the /var/sys/dmesg to find a detailed message regarding to this, however no message with "kill" was entered. What might be the problem?
Cuda version: 9.0
Pytorch version: 1.1.0
If you had root access, you could check whether this is memory issue or not by dmesg command.
In my case, the process was killed by kernel due to out of memory.
I found the cause to be saving tensors require grad to a list and each of those stores an entire computation graph, which consumes significant memory.
I fixed the issue by saving .detach() tensor instead of saving tensors returned by loss function to the list.
You can type "dmesg" on your terminal and scroll down to the bottom. It will show you the message of why it is killed.
Since you mentioned PyTorch, the chances are that your process is killed due to "Out of Memory". To resolve this, reduce your batch size till you no longer see the error.
Hope this helps! :)
In order to give an idea to people who will enconter this:
Apparently, Slurm was installed on the machine so that I needed to give the tasks on Slurm.

Anaconda Kernel and Google Colab crash when using cv2.FastFeatureDetector()

I am trying to use the cv2.FastFeatureDetector() method and everytime i run this code to extract features my kernel in both Google Collab and Anaconda crashes for some reason. Initially I thought it was memory management issue with my system, but the same thing is happening in Colab.
import cv2
import numpy as np
image=cv2.imread('tree.jpg',0)
fast=cv2.FastFeatureDetector()
keypoints=fast.detect(image,None)
#After running this code my kernel crashes
There is no error message due to the kernel crash.
The image is fairly small in size and not that computationally expensive.
Here is the image:
https://www.setaswall.com/wp-content/uploads/2017/06/Sun-Tree-Branches-1920-x-1080.jpg
I had the same problem. With newer OpenCV versions you have to create your detector via fast = cv2.FastFeatureDetector_create(). Note that you might have to adjust the rest of your code due to other API changes.

Multiple python calls from bash but no speed-up

I want to run a Python3 process multiple times with different hyperparameters. To fully utilize the available CPU's, I want to spawn the process multiple times. However, I hardly observe any speed-up in practice. Below I will reproduce a small test that illustrates the effect.
First a Python test script:
(speed_test.py)
import numpy as np
import time
now = time.time()
for i in range(50):
np.matmul(np.random.rand(1000,1000),np.random.rand(1000,1000))
print(round(time.time()-now,1))
A single call: python3 speed_test.py prints 10.0 seconds.
However, when I try to run 2 processes in parallel:
python3 speed_test.py & python3 speed_test.py & wait prints 18.6 18.9.
parallel python3 speed_test.py ::: {1..2} prints 18.3 18.7.
It seems as if parallelization hardly buys me anything here (two executions in almost twice the time). I know I can't expect a linear speed-up, but this seems to be very little difference. My system has 1 socket with 2 cores per socket and 2 threads per core (4 CPUs in total). I see the same effect on a 8 CPU Google Cloud instance. Roughly, the computational time improves no more than ~10-20% per process, when running in parallel.
Finally, pinning CPUs to processes does not help much either:
taskset -c 0-1 python3 speed_test.py & taskset -c 2-3 python3 speed_test.py & wait prints 17.1 17.8
I thought each Python process could only utilize 1 CPU due to the Global Interpreter Lock. Is there anyway to speed-up my code?
Thanks for the reply #TomFenech, I should have added the CPU usage information indeed:
Local (4 vCPU): Single call = ~390%, double call ~190-200% each
Google cluster (8 vCPUs): single call ~400%, double call ~400% each (as expected)
Conclusion of toy-example: You are right. When I call htop, I actually see 4 processes per started job, not 1. So the job is internally distributing itself. I think this is related, distributing happens for (matrix) multiplication by BLAS/MKL.
Continuation for true job: So, the above toy-example was actually more involved and not a perfect case for my true script. My true (machine learning) script only partially relies on Numpy (not for matrix multiplication), but most heavy computation is performed in PyTorch. When I call my script locally (4 vCPU), it uses ~220% CPU. When I call that script on the Google Cloud cluster (8 vCPU), it - suprisingly - gets even up to ~700% (htop indeed shows 7-8 processes). So PyTorch seems to be doing an even better job at distributing itself.
(The Numpy BLAS version can be retrieved with np.__config__.show(). My local Numpy uses OpenBlas, the Google cluster uses MKL (Conda installation). I can't find a similar command to check for the BLAS version of PyTorch, but assume it uses the same.)
In general, the conclusion seems that both Numpy and PyTorch itself already take care of distributing code when it comes to matrix multiplication (and all CPUs are locally visible, i.e. no cluster/server setting). Therefore, if most of your script is matrix multiplication, then there is less reason than (at least I) expected to distribute scripts yourself.
However, not all of my code is matrix multiplication. Therefore, in theory I should still be able to get a speed-up from parallel processes. I wrote a new test, with 50/50 linear and matrix multiplication code:
(speed_test2.py)
import time
import torch
import random
now = time.time()
for i in range(12000):
[random.random() for k in range(10000)]
print('Linear time',round(time.time()-now,1))
now = time.time()
for j in range(350):
torch.matmul(torch.rand(1000,1000),torch.rand(1000,1000))
print('Matrix time',round(time.time()-now,1))
Running this on Google Cloud (8 vCPU):
Single process gives Linear time 12.6, Matrix time 9.2. (CPU during first part 100%, second part 500%)
Parallel process python3 speed_test2.py & python3 speed_test2.py gives Linear time 12.6, Matrix time 15.4 for both processes.
Adding a third process gives Linear time ~12.7, Matrix time 25.2
Conclusion: Although there are 8 vCPU here, the Pytorch/matrix (second) part of the code actually gets slower with more than 2 processes. The linear part of the code does of course increase (up to 8 parallel processes). I think this altogether explains why in practice, Numpy/PyTorch code may not show that much improvement when you start multiple concurrent processes. And that it may not always be beneficial to naively start 8 processes when you see 8 vCPUs. Please correct me if I am wrong somewhere here.

Memory LEAK in matplotlib imshow

I have identified a memory leak in matplotlib.imshow. I am aware of similar questions (like Excessive memory usage in Matplotlib imshow) and I've read the related ironpython thread (https://github.com/ipython/ipython/issues/1623/).
I believe that the code below should (in the absence of a memory leak) consume a constant amount of memory while running. Instead, it grows with each iteration.
I'm running the most recent version I can find (matplotlib-1.2.0rc3.win32-py2.7 and numpy-1.7.0.win32-py2.7), and the problem remains. I'm not keeping the return value of imshow, and in fact I'm explicitly deleting it, so I think the note in IronPython discussion doesn't apply. The behavior is identical with and without the explicit assignment-and-del inside the loop.
I see the same behavior with matplotlib-1.2.0.win32-py2.7.
Each iteration seems to hang onto whatever memory was needed for the image. I've
chosen a large (1024x1024) random matrix to make the size of each image interestingly large.
I'm running Win7 pro with 2G of physical RAM, 32-bit python2.7.3 (hence the memory error), and the above numpy and matplotlib packages. The code below fails with a memory error in iteration 440 or so. The windows task manager reports consumption of 1,860,232K when it fails.
Here is code that demonstrates the leak:
IMAGE_SIZE = 1024
import random
RANDOM_MATRIX = []
for i in range(IMAGE_SIZE):
RANDOM_MATRIX.append([random.randint(0, 100) for each in range(IMAGE_SIZE)])
def exercise(aMatrix, aCount):
for i in range(aCount):
anImage = imshow(aMatrix, origin='lower left', vmin=0, vmax=100)
del(anImage)
if __name__=='__main__':
from pylab import *
exercise(RANDOM_MATRIX, 4096)
I can presumably render the image with PIL instead matplotlib. In the absence of a workaround, I do think this is a show-stopper for matplotlib.
I struggled to make it work because many post talk about this problem, but no one seems to care about providing a working example.
first of all, you should never use the from ... import * syntax, when using a library you didn't make yourself - because, you can never be sure it doesn't declare a symbol which would conflict with yours.
Then, calling set_data is not sufficient to solve this problem - for three reasons :
You didn't mention where this set_data is called from. It is not a
normal function but a method from an object... Which object ?
set_data alone won't be sufficient if you do not have something to
"activate" the changes. Sometimes it will happen transparently
because another plot activates it, but if it doesn't, you will need
to call flush_events() by yourself.
Set data won't work if you haven't called imshow() with values it
can use to setup it's color map.
Here is a working solution (link):
I think I found a workaround, I didn't fully realize how heavyweight imshow is.
The answer is to call imshow just once, then call set_data with RANDOM_MATRIX for each subsequent image.
Problem solved!

Resources