Python 3.8 RAM owerflow and loading issues - python-3.x

First, I want to mention, that this is our first project in a bigger scale and therefore we don't know everything but we learn fast.
We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.
But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method('spawn') because otherwise I will get the following error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the 'spawn' start method"
Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to 'spawn' the ERROR disappears but the Jetson starts to allocate way to much memory.
Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: "cannot allocate memory" which is obvious.
I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.
#clasify
import torchvision.transforms as transforms
from skimage import io
import time
from torch.utils.data import Dataset
from .loader import *
from .ResNet import *
#if this part is in the classify() function than no allocation problem occurs
net = ResNet152(num_classes=25)
net = net.to('cuda')
save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
net.load_state_dict(save_file)
def classify(imgp=""):
#do some classification with the net
pass
if __name__ == '__main__':
mp.set_start_method('spawn') #if commented out the first error ocours
manager = mp.Manager()
return_dict = manager.dict()
p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
p.start()
p.join()
print(return_dict.values())
Any help here will be much appreciated. Thank you.

Related

pytorch loss.backward() keeps running for hours

I am using pytorch to train some x-ray images but I ran into the following issue:
in the line : loss.backward(), the program just keeps running and never end, and there is no error or warning.
loss, outputs = self.forward(images, targets)
loss = loss / self.accumulation_steps
print("loss calculated: " + str(loss))
if phase == "train":
print("running loss backwarding!")
loss.backward()
print("loss is backwarded!")
if (itr + 1 ) % self.accumulation_steps == 0:
self.optimizer.step()
self.optimizer.zero_grad()
The loss calculated before this is something like tensor(0.8598, grad_fn=<DivBackward0>).
Could anyone help me with why this keeps running or any good ways to debug the backward() function?
I am using torch 1.2.0+cu92 with the compatible cuda 10.0.
Thank you so much!!
It's hard to give a definite answer but I have a guess.
Your code looks fine but from the output you've posted (tensor(0.8598, grad_fn=<DivBackward0>)) I conclude that you are operating on your CPU and not on the GPU.
One possible explanation is that the backwards pass is not running forever, but just takes very very long. Training a large network on a CPU is much slower than on a GPU. Check your CPU and memory utilization. It might be that your data and model is too big to fit into your main memory, forcing the operation system to use your hard disk, which would slow down execution by several additional magnitudes. If this is the case I generally recommend:
Use a smaller batch size.
Downscale your images (if possible).
Only open images that are currently needed.
Reduce the size of your model.
Use your GPU (if available) by calling model.cuda(); images = images.cuda() before starting your training.
If that doesn't solve your problem you could start narrowing down the issue by doing some of the following:
Create a minimal working example to reproduce the issue.
Check if the problem persists with other, very simple model architectures.
Check if the problem persists with different input data
Check if the problem persists with a different PyTorch version

Tensorflow supports multiple threads/streams on one GPU for training?

UPDATE:
I found the source code of GPUDevice, it hard-coded max streams to 1, may I know the know reason?
GPUDevice(const SessionOptions& options, const string& name,
Bytes memory_limit, const DeviceLocality& locality,
TfGpuId tf_gpu_id, const string& physical_device_desc,
Allocator* gpu_allocator, Allocator* cpu_allocator)
: BaseGPUDevice(options, name, memory_limit, locality, tf_gpu_id,
physical_device_desc, gpu_allocator, cpu_allocator,
false /* sync every op */, 1 / max_streams /) {
if (options.config.has_gpu_options()) {
force_gpu_compatible_ =
options.config.gpu_options().force_gpu_compatible();
}
======================================
I am wondering whether TensorFlow(1.x version) supports multi-thread or multi-stream on a single GPU. If not, I am curious the underlying reasons, TF did this on some purposes or some libs like CUDA prevents TF from providing or some other reasons?
Like some previous posts[1,2], I tried to run multiple training ops in TF, i.e. sees.run([train_op1, train_op2],feed_dict={...}), I used the TF timeline to profile each iteration. However, TF timeline always showed that two train ops run sequentially (although timeline is not accurate[3], the wall time of each op suggests sequential running). I also looked at some source code of TF, it looks like the each op are computed by in device->ComputeAsync() or device->Compute(), and the GPU is blocked when computing an op. If I am correct, one GPU can only run a single op each time, which may lower GPU utilization.
1.Running multiple tensorflow sessions concurrently
2.Run parallel op with different inputs and same placeholder
3.https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-244251867
I have similar experience with you.
I have two GPU, each GPU run three threads, each thread running a session, each session running time fluct a lot.
if run only one thread on each GPU, session running time is quite stable.
from these appearence, we can conclude that ,thread in tensorflow not cowork well,
the mechanism of tensorflow has problem.

Memory error in pycharm using scipy's welch function

I want to get the Welch's periodogram using scipy.signal in pycharm. My signal is an 5-min audio file with Fs = 48 kHz, so I guess it's a very big signal. The line was:
f, p = signal.welch(audio, Fs, nperseg=512)
I am getting a memory error. I was wondering if that's a pycharm configuration thing, or it's just a too big signal. My RAM is 8 Gb.
Sometimes it works with some audio files, but the idea is to do it with several, so after one or two, the error raises.
I've tested your setup and welch does not seem to be the problem. For further analysis the entire script you are running would be necessary.
import numpy as np
from scipy.signal import welch
fs = 48000
signal_length = 5 * 60 * fs
audio_signal = np.random.rand(signal_length)
f, Pxx = welch(audio_signal, fs=fs, nperseg=512)
On my computer (windows 10, 64 bit) it consumes 600 MB of peak memory during the call to welch which gets recycled directly afterwards, additionally to ~600MB of allocation for the initial array and Python itself. The call to welch itself does not lead to any permanent significant memory increase.
You can do the following:
Upgrade to newest version of scipy, as there have been problems with Welch previously
Check that your PC has enough free memory and close memory-hungry applications (eg. chrome)
Convert your array in a lower datatype e.g. from float64 to float32 or float16
Make sure to free variables that are not needed anymore . Especially if you load several signals and store the result in different arrays, it can accumulate quite quickly. Only keep what you need and delete vars via del variable_name, check that there are no references remaining elsewhere in the program. E.g if you don't need the audio variable, either delete it explicitly after welch(...) or overwrite it with the next audio data.
Run the garbage collector gc.collect(). However, this will probably not solve your problem as garbage is managed automatically in Python anyway.

Debugging the optmization run while training variables of a pre-trained tensorflow model

I am loading a pre-trained model and then extracting only the trainable variables which I want to optimize (basically change or fine-tune) according to my custom loss. The problem is the moment I pass a mini-batch of data to it, it just hangs and there is no progress. I used Tensorboard for visualization but don't know how to debug when there is no log info available. I had put some basic print statements around it but didn't get any helpful information.
Just to give an idea, this is the piece of code sequentially
# Load and build the model
model = skip_thoughts_model.SkipThoughtsModel(model_config, mode="train")
with tf.variable_scope("SkipThoughts"):
model.build()
theta = [v for v in tf.get_collection(tf.GraphKeys.MODEL_VARIABLES, scope='SkipThoughts') if "SkipThoughts" in v.name]
# F Representation using Skip-Thoughts model
opt_F = tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
# Training
sess.run([opt_F], feed_dict = {idx: idxTensor})
And the model is from this repository:
The problem is with training i.e. the last step. I verified that the theta list is not empty it has 26 elements in it, like ...
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/beta:0
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/gamma:0
SkipThoughts/logits/weights:0
SkipThoughts/logits/biases:0
SkipThoughts/decoder_post/gru_cell/gates/layer_norm/w_h/beta:0
...
Also, even after using tf.debug the issue remains. Maybe it really takes lot of time or is stuck awaiting for some other process? So, I also tried breaking down the
tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
step into
gvs = tf.train.AdamOptimizer(learning_rate).compute_gradients(model.total_loss, var_list=theta)
opt_F = opt.apply_gradients(gvs)
...
g = sess.run(gvs, feed_dict = {idx: idxTensor})
so that I can check if the gradients are computed in the first place, which got stuck at the same point. In addition to that, I also tried computing the gradients with tf.gradients over just one of the variables and that too for one dimension, but the issue still exists.
I am running this piece of code on an IPython notebook on Azure Cluster with 1 GPU Tesla K80. The GPU usage stays the same throughout the execution and there is no out of memory error.
The kernel interrupt doesn't work and the only way to stop it is by restarting the notebook. Moreover, if I compile this code into a Python file then too I need to explicitly kill the process. However, in any such case I don't get the stack trace to know what is the exact place it is stuck! How should one debug such an issue?
Any help and pointers in this regard would be much appreciated.

Sklearn OneClassSVM can not handle large data sets(0xC0000005)

I am using OneClassSVM for outlier detection.
clf = svm.OneClassSVM(kernel='rbf', nu=k, tol=0.001)
clf.fit(train_x)
However it encountered the following error.
Process finished with exit code -1073741819 (0xC0000005)
The data size is around 20MB(train_x), which is not very big too me.
Memory of my computer is 8GB.
However if I reduce the file size, it worked. And this behavior is inconsistent.
Sometimes it worked and did not work otherwise.
Has any one had this problem before?
thanks!

Resources