Raspberry PI 4 very slow predictions with TensorflowLite object detection api - python-3.x

I'm running TensorFlow lite object detection in raspberry pi 4 model b 8GB of ram and the prediction is very slow at 1.5 to 2 frame rate per second Is there a way to get better performance to improve prediction at least 5 to 10 fps

I would recommend to use this tool to try different running option: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark
Any maybe use --enable_op_profiling to see which ops make it slow. A quick fix might be enable multi-threading or use_gpu?
If you build TFLite use cmake, please set TFLITE_ENABLE_RUY=ON.

Related

pytorch loss.backward() keeps running for hours

I am using pytorch to train some x-ray images but I ran into the following issue:
in the line : loss.backward(), the program just keeps running and never end, and there is no error or warning.
loss, outputs = self.forward(images, targets)
loss = loss / self.accumulation_steps
print("loss calculated: " + str(loss))
if phase == "train":
print("running loss backwarding!")
loss.backward()
print("loss is backwarded!")
if (itr + 1 ) % self.accumulation_steps == 0:
self.optimizer.step()
self.optimizer.zero_grad()
The loss calculated before this is something like tensor(0.8598, grad_fn=<DivBackward0>).
Could anyone help me with why this keeps running or any good ways to debug the backward() function?
I am using torch 1.2.0+cu92 with the compatible cuda 10.0.
Thank you so much!!
It's hard to give a definite answer but I have a guess.
Your code looks fine but from the output you've posted (tensor(0.8598, grad_fn=<DivBackward0>)) I conclude that you are operating on your CPU and not on the GPU.
One possible explanation is that the backwards pass is not running forever, but just takes very very long. Training a large network on a CPU is much slower than on a GPU. Check your CPU and memory utilization. It might be that your data and model is too big to fit into your main memory, forcing the operation system to use your hard disk, which would slow down execution by several additional magnitudes. If this is the case I generally recommend:
Use a smaller batch size.
Downscale your images (if possible).
Only open images that are currently needed.
Reduce the size of your model.
Use your GPU (if available) by calling model.cuda(); images = images.cuda() before starting your training.
If that doesn't solve your problem you could start narrowing down the issue by doing some of the following:
Create a minimal working example to reproduce the issue.
Check if the problem persists with other, very simple model architectures.
Check if the problem persists with different input data
Check if the problem persists with a different PyTorch version

Can we run tensorflow lite on linux ? Or it is for android and ios only

Hi is there any possibility to run tensorflow lite on linux platform? If yes, then how we can write code in java/C++/python to load and run models on linux platform? I am familiar with bazel and successfully made Android and ios application using tensorflow lite.
I think the other answers are quite wrong.
Look, I'll tell you my experience... I've been working with Django for many years, and I've been using normal tensorflow, but there was a problem with having 4 or 5 or more models in the same project.
I don't know if you know Gunicorn + Nginx. This generates workers, so if you have 4 machine learning models, for every worker it multiplies, if you have 3 workers you will have 12 models preloaded in RAM. This is not efficient at all, because if the RAM overflows your project will fall or in fact the service responses are slower.
So this is where Tensorflow lite comes in. Switching from a tensorflow model to tensorflow lite improves and makes things much more efficient. Times are reduced absurdly.
Also, Django and Gunicorn can be configured so that the model is pre-loaded and compiled at the same time. So every time the API is used up, it only generates the prediction, which helps you make each API call a fraction of a second long.
Currently I have a project in production with 14 models and 9 workers, you can understand the magnitude of that in terms of RAM.
And besides doing thousands of extra calculations, outside of machine learning, the API call does not take more than 2 seconds.
Now, if I used normal tensorflow, it would take at least 4 or 5 seconds.
In summary, if you can use tensorflow lite, I use it daily in Windows, MacOS, and Linux, it is not necessary to use Docker at all. Just a python file and that's it. If you have any doubt you can ask me without any problem.
Here a example project
Django + Tensorflow Lite
It's possible to run (but it will works slower, than original tf)
Example
# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path=graph_file)
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Get quantization info to know input type
quantization = None
using_type = input_details[0]['dtype']
if dtype is np.uint8:
quantization = input_details[0]['quantization']
# Get input shape
input_shape = input_details[0]['shape']
# Input tensor
input_data = np.zeros(dtype=using_type, shape=input_shape)
# Set input tensor, run and get output tensor
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])
I agree with Nouvellie. It is possible and worth the time implementing. I developed a model on my Ubuntu 18.04 32 processor server and exported the model to tflite. The model ran in 178 secs on my ubuntu server. On my raspberry pi4 with 4GB memory, the tflite implementation ran in 85 secs, less than half the time of my server. When I installed tflite on my server the run time went down to 22 secs, an 8 fold increase in performance and now almost 4 times faster than the rpi4.
To install for python, I did not have to build the package but was able to use one of the prebuilt interpreters here:
https://www.tensorflow.org/lite/guide/python
I have Ubuntu 18.04 with python 3.7.7. So I ran pip install with the Linux python 3.7 package:
pip3 install
https://dl.google.com/coral/python/tflite_runtime-2.1.0.post1-cp37-cp37m-linux_x86_64.whl
Then import the package with:
from tflite_runtime.interpreter import Interpreter
Previous posts show how to use tflite.
From Tensorflow lite
TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices.
Tensorflow lite is a fork of tensorflow for embedded devices. For PC just use the original tensorflow.
From github tensorflow:
TensorFlow is an open source software library
TensorFlow provides stable Python API and C APIs as well as without API backwards compatibility guarantee like C++, Go, Java, JavaScript and Swift.
We support CPU and GPU packages on Linux, Mac, and Windows.
>>> import tensorflow as tf
>>> tf.enable_eager_execution()
>>> tf.add(1, 2)
3
>>> hello = tf.constant('Hello, TensorFlow!')
>>> hello.numpy()
'Hello, TensorFlow!'
Yes, you can compile Tensorflow Lite to run on Linux platforms even with a Docker container. See the demo: https://sconedocs.github.io/tensorflowlite/

The meaning of "n_jobs == 1" in GridSearchCV with using multiple GPU

I have been training NN model by using Keras framework with 4 NVIDIA GPU. (Data Row Count: ~160,000, Column Count: 5). Now I want to optimize its parameter by using GridSearchCV.
However, I encountered several different errors whenever I tried to change n_jobs to other values than one. Error, such as
CUDA OUT OF MEMORY
Can not get device properties error code : 3
Then I read this web page,
"# if you're not using a GPU, you can set n_jobs to something other than 1"
http://queirozf.com/entries/scikit-learn-pipeline-examples
So it is not possible to use multiple GPU with GridSearchCV?
[Environment]
Ubuntu 16.04
Python 3.6.0
Keras / Scikit-Learn
Thanks!
According to the FAQ in scikit learn - GPU is NOT supported. Link
You can use n_jobs to use your CPU cores. If you want to run at maximum speed you might want to use almost all your cores:
import multiprocessing
n_jobs = multiprocessing.cpu_count()-1

YOLO - tensorflow works on cpu but not on gpu

I've used YOLO detection with trained model using my GPU - Nvidia 1060 3Gb, and everything worked fine.
Now I am trying to generate my own model, with param --gpu 1.0. Tensorflow can see my gpu, as I can read at start those communicates:
"name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705"
"totalMemory: 3.00GiB freeMemory: 2.43GiB"
Anyway, later on, when program loads data, and is trying to start learning i got following error:
"failed to allocate 832.51M (872952320 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY"
I've checked if it tries to use my other gpu (Intel 630) , but it doesn't.
As i run the train process without "--gpu" mode option, it works fine, but slowly.
( I've tried also --gpu 0.8, 0.4 ect.)
Any idea how to fix it?
Problem solved. Changing batch size and image size in config file didn't seem to help as they didn't load correctly. I had to go to defaults.py file and change them up there to lower, to make it possible for my GPU to calculate the steps.
Look like your custom model use to much memory and the graphic card cannot support it. You only need to use the --batch option to control the size of memory.

how to deal with large linnet object

I am trying to use a whole city network for a particular analysis which I know is very huge. I have also set it as sparse network.
library(maptools)
library(rgdal)
StreetsUTM=readShapeSpatial("cityIN_UTM")
#plot(StreetsUTM)
library(spatstat)
SS_StreetsUTM =as.psp(StreetsUTM)
SS_linnetUTM = as.linnet(SS_StreetsUTM, sparse=TRUE)
> SS_linnetUTM
Linear network with 321631 vertices and 341610 lines
Enclosing window: rectangle = [422130.9, 456359.7] x [4610458,
4652536] units
> SS_linnetUTM$sparse
[1] TRUE
I have the following problems:
It took 15-20 minutes to build psp object
It took almost 5 hours to build the linnet object
every time I want to analyse it for a point pattern or envelope, R crashes
I understand I should try to reduce the network size, but:
I was wondering if there is a smart way to overcome this problem. Would rescaling help?
How can I put it on more processing power?
I am also curios to know if spatstat can be used with parallel package
In the end, what are the limitations on network size for spatstat.
R crashes
R crashes when I use the instructions from Spatstat book:
KN <- linearK(spiders, correction="none") ; on my network (linnet) of course
envelope(spiders, linearK, correction="none", nsim=39); on my network
I do not think RAM is the problem, I have 16GB RAM and 2.5GhZ Dual core i5 processor on an SSD machine.
Could someone guide me please.
Please be more specific about the commands you used.
Did you build the linnet object from a psp object using as.linnet.psp (in which case the connectivity of the network must be guessed, and this can take a long time), or did you have information about the connectivity of the network that you passed to the linnet() command?
Exactly what commands to "analyse it for a point pattern or envelope" cause a crash, and what kind of crash?
The code for linear networks in spatstat is research code which is still under development. Faster algorithms for the K-function will be released soon.
I could only resolve this with simplifying my network in QGIS with Douglas-Peucker algorithm in Simplify Geometries tool. So it is a slight compromise on the geometry of the linear network in the shapefile.

Resources