AzureML- access the notebook running after closing the browser - azure-machine-learning-service

I am currently running a notebook on Azure ML for a neural network that is going to take a few hours to train. I accidentally exited the window. When I open the notebook again, it isn't showing any progress, but the compute instance is still saying that it's running, and the kernel is busy- the CPU and RAM usage are the same as when the model was training too. I would just like to see the progress in the model training, but have no idea how to access it.
Any ideas?- Note, I'm new to AzureML so it might be really simple and I just can't figure it out. Any help is greatly appreciated!

This is an existing issue in AzureML, I have personally raised this using the feedback tool. I suggest doing the same, hoping the more people report this, they should prioritize it.
In the meantime, I suggest you using sys.stdout, to output everything to a log file, which you can have a look later on:
import sys
old_stdout = sys.stdout
log_file = open("message.log","w")
sys.stdout = log_file
print "this will be written to message.log"
sys.stdout = old_stdout
log_file.close()

Related

python-can Keep getting the same message(id)

I'm more of a HW engineer who's currently trying to use Python at work. What I want to accomplish via Python is read the CAN-FD output from the DUT and use it for monitoring purposes in the measurement setup. But, I think I didn't get the correct result. Because it shows the same message(id) even there so much more. Based on my understanding from other examples, this should shows the stream of all the messages since there was no filters. Is there anyone who can help me solve this issue or have the similar experience?
import can
def _get_message(msg):
return msg
bus = can.interface.Bus(bustype='vector',app_name ='app', channel=1, bitrate=500000)
buffer = can.BufferedReader()
can.Notifier(bus,[_get_message,buffer])
while True:
msgs = bus.recv(None)
print(msgs)
You don't say what you expected to see on your bus, but I assume there are a lot of different messages on it.
Simplify your problem to start with - set up a bus which has only a single message on it at a low rate. You might have to write some code for that, or you might get some tools which can send periodic messages with counters in for example. Make sure you can capture that correctly first. Then add a second message at a faster rate.
You will proably learn a selection of useful things from these exercises which mean that when you go back to your full-system you have more success, or at least have more ideas on what to debug :)
First of all, thanks for the answers and comments. the root cause was the incorrect HW configuration for the interface. I thought if HW configuration is wrong in the first place, there will be no output at all. But it turned out it is not. After proper configuration, I could see the output stream I expected.

Python 3.8 RAM owerflow and loading issues

First, I want to mention, that this is our first project in a bigger scale and therefore we don't know everything but we learn fast.
We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.
But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method('spawn') because otherwise I will get the following error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the 'spawn' start method"
Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to 'spawn' the ERROR disappears but the Jetson starts to allocate way to much memory.
Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: "cannot allocate memory" which is obvious.
I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.
#clasify
import torchvision.transforms as transforms
from skimage import io
import time
from torch.utils.data import Dataset
from .loader import *
from .ResNet import *
#if this part is in the classify() function than no allocation problem occurs
net = ResNet152(num_classes=25)
net = net.to('cuda')
save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
net.load_state_dict(save_file)
def classify(imgp=""):
#do some classification with the net
pass
if __name__ == '__main__':
mp.set_start_method('spawn') #if commented out the first error ocours
manager = mp.Manager()
return_dict = manager.dict()
p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
p.start()
p.join()
print(return_dict.values())
Any help here will be much appreciated. Thank you.

Running python script in colab very slow as compared to same code run on directly colab in notebook

Recently I was trying to test my model which i already trained. Initially I was using Google colab notebook to write code because of it's interactive features, once I was done writing code and I was getting satisfactory results, it took around 2.5 hr to give final output. After that what I wanted was to transfer the notebook code to .py script, I did that with little bit of modification, saved it in gdrive, and then used command !python test.py. now it took me more than 4.5 hr to get the final output, can any one explain why does colab take so much time when trying to run the python script from gdrive while compared to the same code as used in notebook
I would add time calculation to every step I doubt that takes time and see which step in your whole program takes the time.
a1 = time.time()
//your code step
print(time.time() - a1)
This will give you the time for each step and you can see which one is taking a long time.
Operations to check.
1. object creation in loops
2. read/write operation to Gdrive
Once you find the problem-causing piece of code, you may change it.
Well it can be because of the fact that colab is retrieving the data from gdrive and then might be again writing in gdrive which will of ofcourse take time i guess

Python too many subprocesses?

I'm trying to start a lot of python procees on a single machine.
Here is a code snippet:
fout = open(path, 'w')
p = subprocess.Popen((python_path,module_name),stdout=fout,bufsize=-1)
After about 100 processes I'm getting the error below:
Running on win 10 64 bit, Python 3.5. Any Idea what that might be? Already tried to split the start (so start from two scripts) as well as sleep command. After a certain number of processes, the error shows up. Any Idea what that might be? Thanks a lot for any hint!
PS:
Some background. Each process opens database connections as well as does some requests using the requests package. Then some calculations are done using numpy, scipy etc.
PPS: Just discover this error message:
dll load failed the paging file is too small for this operation to complete python (when calling scipy)
Issues solved through reinstalling numpy and scipy + installing mkl.
Strange about this error was that it only appeared after a certain number of processes. Would love to hear if anybody knows why this happened!

Debugging the optmization run while training variables of a pre-trained tensorflow model

I am loading a pre-trained model and then extracting only the trainable variables which I want to optimize (basically change or fine-tune) according to my custom loss. The problem is the moment I pass a mini-batch of data to it, it just hangs and there is no progress. I used Tensorboard for visualization but don't know how to debug when there is no log info available. I had put some basic print statements around it but didn't get any helpful information.
Just to give an idea, this is the piece of code sequentially
# Load and build the model
model = skip_thoughts_model.SkipThoughtsModel(model_config, mode="train")
with tf.variable_scope("SkipThoughts"):
model.build()
theta = [v for v in tf.get_collection(tf.GraphKeys.MODEL_VARIABLES, scope='SkipThoughts') if "SkipThoughts" in v.name]
# F Representation using Skip-Thoughts model
opt_F = tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
# Training
sess.run([opt_F], feed_dict = {idx: idxTensor})
And the model is from this repository:
The problem is with training i.e. the last step. I verified that the theta list is not empty it has 26 elements in it, like ...
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/beta:0
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/gamma:0
SkipThoughts/logits/weights:0
SkipThoughts/logits/biases:0
SkipThoughts/decoder_post/gru_cell/gates/layer_norm/w_h/beta:0
...
Also, even after using tf.debug the issue remains. Maybe it really takes lot of time or is stuck awaiting for some other process? So, I also tried breaking down the
tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
step into
gvs = tf.train.AdamOptimizer(learning_rate).compute_gradients(model.total_loss, var_list=theta)
opt_F = opt.apply_gradients(gvs)
...
g = sess.run(gvs, feed_dict = {idx: idxTensor})
so that I can check if the gradients are computed in the first place, which got stuck at the same point. In addition to that, I also tried computing the gradients with tf.gradients over just one of the variables and that too for one dimension, but the issue still exists.
I am running this piece of code on an IPython notebook on Azure Cluster with 1 GPU Tesla K80. The GPU usage stays the same throughout the execution and there is no out of memory error.
The kernel interrupt doesn't work and the only way to stop it is by restarting the notebook. Moreover, if I compile this code into a Python file then too I need to explicitly kill the process. However, in any such case I don't get the stack trace to know what is the exact place it is stuck! How should one debug such an issue?
Any help and pointers in this regard would be much appreciated.

Resources