I want to accelerate the for loop in my codes:
enter image description here
Related
First, I want to mention, that this is our first project in a bigger scale and therefore we don't know everything but we learn fast.
We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.
But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method('spawn') because otherwise I will get the following error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the 'spawn' start method"
Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to 'spawn' the ERROR disappears but the Jetson starts to allocate way to much memory.
Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: "cannot allocate memory" which is obvious.
I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.
#clasify
import torchvision.transforms as transforms
from skimage import io
import time
from torch.utils.data import Dataset
from .loader import *
from .ResNet import *
#if this part is in the classify() function than no allocation problem occurs
net = ResNet152(num_classes=25)
net = net.to('cuda')
save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
net.load_state_dict(save_file)
def classify(imgp=""):
#do some classification with the net
pass
if __name__ == '__main__':
mp.set_start_method('spawn') #if commented out the first error ocours
manager = mp.Manager()
return_dict = manager.dict()
p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
p.start()
p.join()
print(return_dict.values())
Any help here will be much appreciated. Thank you.
I am using eclipse 2018-09 (4.9.0) and PyDev 9.0.3. I am running a "standard" motion detect pipeline with python 3.6 and Opencv 3 on windos 10. I attached a snapshop of pstats from a run of 500 frames. The algorithm runs the usual mask, gray frame, Gaussian blur, absdiff, dilation and contour detection. I made sure there was movement in all frames. I save the movement frames (with moments). Next to file writing with imwrite, cv2.waitKey(1) is the largest consumer of time, to my surprise, slower than running the opencv pipeline (must be good code in opencv). I can't display using cv2.imshow without the cv2.waitKey(1). Is there work to remove this dependency, and, put the keyboard event in a callback?
!
I'm creating a Raspberry Pi Zero W security camera and am attempting to integrate motion detection using Node.js. Images are being taken with Pi camera module at 8 Megapixels (3280x2464 pixels, roughly 5MB per image).
On a Pi Zero, resources are limited, so loading an entire image from file to Node.js may limit how fast I can capture then evaluate large photographs. Surprisingly, I capture about two 8MB images per second in a background time lapse process and hope to continue to capture the largest sized images roughly once per second at least. One resource that could help with this is extracting the embedded thumbnail from the large image (thumbnail size customizable in raspistill application).
Do you have thoughts on how I could quickly extract the thumbnail from a large image without loading the full image in Node.js? So far I've found a partial answer here. I'm guessing I would manage this through a buffer somehow?
I have a loop of reading Voltages from Arduino (with a specific sampling rate and clock Frequency).
When I read data without plotting (loop include only the fread/fscanf and i++), the data appears without any problems.
Once I add rolling plot to display the acquired data, the signal will be lost at a sudden and the program stops. Any clarification for that?
If there exist a sample code for multi-threading to plot and perform data acquisition in the same time, I would be very grateful.
Thank you!
I am not sure whether it works for you, but I had the same problem and I used pause(t) before the end of the loop.
My application (QT/OpenGL) needs to upload, at 25fps, a bunch of videos from IP camaras, and then process it applying:
for each videos, a demosaic filter, sharpening filter, LUT and
distortion docrretion.
Then i need to render in opengl (texture projection, etc..) picking one or more frames processed earlier
Then I need to show the result to some widgets (QGLWidget) and read the pixels to write into a movie file.
I try to understand the pros and cons of PBO and FBO, and i picture the following architecture which i want to validate with your help:
I create One thread per video, to capture in a buffer (array of images). There is one buffer for video.
I create a Upload-filter-render thread which aims to: a) upload the frames to GPU, b) apply the filter into GPU, c) apply the composition and render to a texture
I let the GUI thread to render in my widget the texture created in the previous step.
For the Upload-Frames-to-GPU process, i guess the best way is to use PBO (maybe two PBOS) for each video, to load asynchronously the frames.
For the Apply-Filter-info-GPU, i want to use FBO which seems the best to do render-to-texture. I will first bind the texture uploaded by the PBO, and then render to another texture, the filtered image. I am not sure to use only one FBO and change the binding texture input and binding texture target according the video upload, or use as many FBOS, as videos to upload.
Finally, in order to show the result into a widget, i use the final texture rendered by the FBO. For writing into a movie file, i use PBO to copy back asynchronously the pixels from GPU to CPU.
Does it seem correct?