Hybrid quantization of ONNX model possible? - onnx

From the documentation of the onnxruntime I could not find any information regarding support of hybrid quantization.
documentation on onnxruntime quantization
example code for quantizing ONNX model
I would like to quantize an ONNX model to uint8. The target hardware the model is meant to run on supports uint8 and int16 operations (for a few layers probably also float32could work).
I could not find any information on hybrid quantization support.
Is it possible to apply hybrid quantization, where sensitive layers get a higher resolution?
Or is it just 'all or nothing'?

Related

Do pytorch lite model have performance drop?

I would like to deploy my pytorch model on mobile devices.
Based on here: https://pytorch.org/mobile/home/#deployment-workflow. I need to quantize my model and convert to ptl file.
it seems like I can do inference using quantized model (https://discuss.pytorch.org/t/how-to-load-quantized-model-for-inference/140283) so that I can evaluate whether the performance is dropped.
But I don’t know how to make sure there is no performance drop from quantized/optimized model to ptl file? Or it is guaranteed that conversion from quantized model to ptl has no performance drop?
Thanks!!

Is there any way to speed up the predicting process for tensorflow lattice?

I build my own model with Keras Premade Models in tensorflow lattice using python3.7 and save the trained model. However, when I use the trained model for predicting, the speed of predicting each data point is at millisecond level, which seems very slow. Is there any way to speed up the predicting process for tfl?
There are multiple ways to improve speed, but they may involve a tradeoff with prediction accuracy. I think the three most promising options are:
Reduce the number of features
Reduce the number of lattices per feature
Use an ensemble of lattice models where every lattice model only gets a subsets of the features and then average the predictions of the different models (like described here)
As the lattice model is a standard Keras model, I recommend trying OpenVINO. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime. OpenVINO is optimized for Intel hardware, but it should work with any CPU.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. If you care about latency, I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement. If you care about throughput, change the value to THROUGHPUT or CUMULATIVE_THROUGHPUT.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Is there a pretrained model that can detect and classify if a human is in a photo?

I am trying to find a pre-trained model that will classify images based on if there is a human present in the photo or not.
You can use the models trained on the COCO dataset for this.
For example, for Pytorch you can have a look at the official documentation concerning the provided models here.
There are more variety of models if you give it a simple search both for Pytorch and other frameworks.
You can check out the COCO homepage if you need more information concerning the dataset and the tasks it supports.
You may also find These useful:
Detecting people using Yolo-OpenCV
Yolo object detection in pytorch
Another Yolo implementation in Pytorch
Similar question on ai.stackexchange
You can also utilize frameworks such as Detectorn2, mmdetection for these tasks.(Or Tensorflow's ObjectDetectionAPI , ect)

Resolution preserving Fully Convolutional Network

I am new to ML and Pytorch and I have the following problem:
I am looking for a Fully Convolutional Network architecture in Pytorch, so that the input would be an RGB image (HxWxC or 480x640x3) and the output would be a single channel image (HxW or 480x640). In other words, I am looking for a network that will preserve the resolution of the input (HxW), and will loose the channel dimension. All of the networks that I've came across (ResNet, Densenet, ...) end with a fully connected layer (without any upsampling or deconvolution). This is problematic for two reasons:
I am restricted with the choice of the input size (HxWxC).
It has nothing to do with the output that I expect to get (a single channel image HxW).
What am I missing? Why is there even a FC layer? Why is there no up-sampling, or some deconvolution layers after feature extraction? Is there any build-in torchvision.model that might suit my requirements? Where can I find such pytorch architecture? As I said, I am new in this field so I don't really like the idea of building such a network from scratch.
Thanks.
You probably came across the networks that are used in classification. So they end up with a pooling and a fully connected layer to produce a fixed number of categorical output.
Have a look at Unet
https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/
Note: the original unet implementation use a lot of tricks.
You can simply downsample and then upsample symmetrically to do the work.
Your kind of task belongs to dense classification tasks, e.g. segmentation. In those tasks, we use fully convolution nets (see here for the original paper). In the FCNs you don't have any fully-connected layers, because when applying fully-connected layers you lose spatial information which you need for the dense prediction. Also have a look at the U-Net paper. All state-of-the art architectures use some kind of encoder-decoder architecture extended for example with a pyramid pooling module.
There are some implementations in the pytorch model zoo here. Search also Github for pytorch implementations for other networks.

Where can one configure the CNN used by the Spacy TextCategorizer?

According to the comment at the top of the TextCategorizer,
Train a convolutional neural network text classifier on the IMDB
dataset, using the TextCategorizer component. The dataset will be
loaded automatically via Thinc's built-in dataset loader. The model is
added to spacy.pipeline, and predictions are available via doc.cats.
For more details, see the documentation:
* Training: https://spacy.io/usage/training
Where is the code for the CNN? Can the CNN be configured? Is there a research paper the implementation is based on?
The network architecture is defined in the _.ml module specifically within the build_text_classifier function.
The code related with the training is within the pipeline module specifically within the TextCategorizer class.
Some parameters like drop_out, batch_size and the number of epochs can be configured as showed in the example, you can also modify the architecture of the network but for that you have to know about the framework behind spaCy which is called Thinc https://github.com/explosion/thinc and some Cython.
I don't know about any paper describing the model but this video provide a great description of it https://www.youtube.com/watch?v=sqDHBH9IjRU

Resources