Running python script in colab very slow as compared to same code run on directly colab in notebook - python-3.x

Recently I was trying to test my model which i already trained. Initially I was using Google colab notebook to write code because of it's interactive features, once I was done writing code and I was getting satisfactory results, it took around 2.5 hr to give final output. After that what I wanted was to transfer the notebook code to .py script, I did that with little bit of modification, saved it in gdrive, and then used command !python test.py. now it took me more than 4.5 hr to get the final output, can any one explain why does colab take so much time when trying to run the python script from gdrive while compared to the same code as used in notebook

I would add time calculation to every step I doubt that takes time and see which step in your whole program takes the time.
a1 = time.time()
//your code step
print(time.time() - a1)
This will give you the time for each step and you can see which one is taking a long time.
Operations to check.
1. object creation in loops
2. read/write operation to Gdrive
Once you find the problem-causing piece of code, you may change it.

Well it can be because of the fact that colab is retrieving the data from gdrive and then might be again writing in gdrive which will of ofcourse take time i guess

Related

Synchronization problem while executing Simulink FMU in ROS 2-Gazebo (TF_OLD_DATA warning)

I'm working in a co-simulation project between Simulink and Gazebo. The aim is to move a robot model in Gazebo with the trajectory coordinates computed from Simulink. I'm using MATLAB R2022a, ROS 2 Dashing and Gazebo 9.9.0 in a computer running Ubuntu 18.04.
The problem is that when launching the FMU with the fmi_adapter, I'm obtaining the following. It is tagged as [INFO], but actually messing up all my project.
[fmi_adapter_node-1] [INFO] [fmi_adapter_node]: Simulation time 1652274762.959713 is greater than timer's time 1652274762.901340. Is your step size to large?
Note the timer's time is higher than the simulation time. Even if I try to change the step size with the optional argument of the fmi_adapter_node, the same log appears with small differences in the times. I'm using the next commands:
ros2 launch fmi_adapter fmi_adapter_node.launch.py fmu_path:=FMI/Trajectory/RobotMARA_SimulinkFMU_v2.fmu # default step size: 0.2
ros2 launch fmi_adapter fmi_adapter_node.launch.py fmu_path:=FMI/Trajectory/RobotMARA_SimulinkFMU_v2.fmu _step_size:=0.001
As you would expect, the outputs of the FMU are the xyz coordinates of the robot trajectory in each time step. Since the fmi_adapter_node creates topics for both inputs and outputs, I'm reading the output xyz values by means of 3 subscribers with the next code. Then, those coordinates are being used to program the robot trajectories with the MoveIt-Python API.
When I run the previous Python code, I'm obtaining the following warning once and again and the robot manipulator actually doesn't move.
[ WARN] [1652274804.119514250]: TF_OLD_DATA ignoring data from the past for frame motor6_link at time 870.266 according to authority unknown_publisher
Possible reasons are listed at http://wiki.ros.org/tf/Errors%20explained
The previous warning is explained here, but I'm not able to fix it. I've tried clicking Reset in RViz, but nothing changes. I've also tried the following without success:
ros2 param set /fmi_adapter_node use_sim_time true # it just sets the timer's time to 0
It seems that the clock is taking negative values, so there is a synchronization problem.
Any help is welcome.
The warning message by the FMIAdapterNode is emitted if the timer's period is only slightly greater than the simulation step-size and if the timer is preempted by other processes or threads.
I created an issue at https://github.com/boschresearch/fmi_adapter/issues/9 which explains this in more detail and lists two possible fixes. It would be great if you could contribute to this discussion.
I assume that the TF_OLD_DATA error is not related to the fmi_adapter. Looking at the code snippet at ROS Answers, I wondered whether x,y,z values are re-published at all given that the lines
pose.position.x = listener_x.value
pose.position.y = listener_y.value
pose.position.z = listener_z.value
are not inside a callback and executed even before rospy.spin(), but maybe that's just truncated.

How to set seed to compute_face_descriptor function in dlib?

I am using compute_face_descriptor function in dlib which is the function of dlib.face_recognition_model_v1('dlib_face_recognition_resnet_model_v1.dat').
There is an option to set "num_jitters". I set "num_jitters"=10, but the output embedding I am getting different on subsequent runs. I have tried setting seed using np.random.seed(43), but still, the output changes on subsequent runs
Is there a way to set seed in this function using "num_jitters"=10 so that the output embedding doesn't change on subsequent runs?
"num_jitters". means how many times dlib will re-sample your face image each time randomly moving it a little bit. That's why you are getting different embeddings.

Python 3.8 RAM owerflow and loading issues

First, I want to mention, that this is our first project in a bigger scale and therefore we don't know everything but we learn fast.
We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.
But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method('spawn') because otherwise I will get the following error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the 'spawn' start method"
Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to 'spawn' the ERROR disappears but the Jetson starts to allocate way to much memory.
Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: "cannot allocate memory" which is obvious.
I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.
#clasify
import torchvision.transforms as transforms
from skimage import io
import time
from torch.utils.data import Dataset
from .loader import *
from .ResNet import *
#if this part is in the classify() function than no allocation problem occurs
net = ResNet152(num_classes=25)
net = net.to('cuda')
save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
net.load_state_dict(save_file)
def classify(imgp=""):
#do some classification with the net
pass
if __name__ == '__main__':
mp.set_start_method('spawn') #if commented out the first error ocours
manager = mp.Manager()
return_dict = manager.dict()
p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
p.start()
p.join()
print(return_dict.values())
Any help here will be much appreciated. Thank you.

Python too many subprocesses?

I'm trying to start a lot of python procees on a single machine.
Here is a code snippet:
fout = open(path, 'w')
p = subprocess.Popen((python_path,module_name),stdout=fout,bufsize=-1)
After about 100 processes I'm getting the error below:
Running on win 10 64 bit, Python 3.5. Any Idea what that might be? Already tried to split the start (so start from two scripts) as well as sleep command. After a certain number of processes, the error shows up. Any Idea what that might be? Thanks a lot for any hint!
PS:
Some background. Each process opens database connections as well as does some requests using the requests package. Then some calculations are done using numpy, scipy etc.
PPS: Just discover this error message:
dll load failed the paging file is too small for this operation to complete python (when calling scipy)
Issues solved through reinstalling numpy and scipy + installing mkl.
Strange about this error was that it only appeared after a certain number of processes. Would love to hear if anybody knows why this happened!

Debugging the optmization run while training variables of a pre-trained tensorflow model

I am loading a pre-trained model and then extracting only the trainable variables which I want to optimize (basically change or fine-tune) according to my custom loss. The problem is the moment I pass a mini-batch of data to it, it just hangs and there is no progress. I used Tensorboard for visualization but don't know how to debug when there is no log info available. I had put some basic print statements around it but didn't get any helpful information.
Just to give an idea, this is the piece of code sequentially
# Load and build the model
model = skip_thoughts_model.SkipThoughtsModel(model_config, mode="train")
with tf.variable_scope("SkipThoughts"):
model.build()
theta = [v for v in tf.get_collection(tf.GraphKeys.MODEL_VARIABLES, scope='SkipThoughts') if "SkipThoughts" in v.name]
# F Representation using Skip-Thoughts model
opt_F = tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
# Training
sess.run([opt_F], feed_dict = {idx: idxTensor})
And the model is from this repository:
The problem is with training i.e. the last step. I verified that the theta list is not empty it has 26 elements in it, like ...
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/beta:0
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/gamma:0
SkipThoughts/logits/weights:0
SkipThoughts/logits/biases:0
SkipThoughts/decoder_post/gru_cell/gates/layer_norm/w_h/beta:0
...
Also, even after using tf.debug the issue remains. Maybe it really takes lot of time or is stuck awaiting for some other process? So, I also tried breaking down the
tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
step into
gvs = tf.train.AdamOptimizer(learning_rate).compute_gradients(model.total_loss, var_list=theta)
opt_F = opt.apply_gradients(gvs)
...
g = sess.run(gvs, feed_dict = {idx: idxTensor})
so that I can check if the gradients are computed in the first place, which got stuck at the same point. In addition to that, I also tried computing the gradients with tf.gradients over just one of the variables and that too for one dimension, but the issue still exists.
I am running this piece of code on an IPython notebook on Azure Cluster with 1 GPU Tesla K80. The GPU usage stays the same throughout the execution and there is no out of memory error.
The kernel interrupt doesn't work and the only way to stop it is by restarting the notebook. Moreover, if I compile this code into a Python file then too I need to explicitly kill the process. However, in any such case I don't get the stack trace to know what is the exact place it is stuck! How should one debug such an issue?
Any help and pointers in this regard would be much appreciated.

Resources