ONNX Runtime and CWaitCursor on Windows - visual-c++

I'm running a lengthy algorithm on Windows 10, written in MS Visual C++. A portion of the algorithm is inferencing an ONNX model using ORT. I want to spin the wait cursor so the user knows the algorithm is running and not done yet. It spins except for the part of the algorithm that is ORT inferencing the ONNX model. I cannot figure out how to have the wait cursor spin during an ORT inferencing. Got an idea? Thanks.

Related

how to convert ML Project from a GPU project to CPU project?

I am learning ML and i want to re train a AI model for lane detection.
I want to be familiar with the ML training process. The accuracy/result is not my primary goal and i do not need a best ML model for lane detection.
I found this AI model and want to try it out. But i have been facing a problem:
I do not have a GPU, so i wish i can train this model with my CPU. But sadly some part of this code is written with CUDA. Is there a way, i can convert this GPU code to CPU code only?
Should i find another AI-model only for the CPU training?
you can use the tensor.to(device) command to move a tensor to a device.
The .to() command is also used to move a whole model to a device, like in the post you linked to.
Another possibility is to set the device of a tensor during creation using the device= keyword argument, like in t = torch.tensor(some_list, device=device)
To set the device dynamically in your code, you can use
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
to set cuda as your device if possible.
Above is the answer for how to add CUDA in the code. SO Use Cntrl + F and remove all the keywords which forces code to run on GPU. Such as "device", "to"

model.to(device) for Pytorch Lighting

I currently train my model using GPUs using Pytorch Lightning
trainer = pl.Trainer( gpus=[0,1],
distributed_backend='ddp',
resume_from_checkpoint=hparams["resume_from_checkpoint"])
trainer.fit(model, train_dataloader=train_loader, val_dataloaders=val_loader)
Instructions are also clear for how to run test samples with trainer defined to use GPU
trainer.test(test_dataloader=test_dataloader)
and also how to load a model and use it interactively
model = transformer.Model.load_from_checkpoint('/checkpoints/run_300_epoch_217.ckpt')
results = model(in_data,
I use the later to interface with an interactive system via sockets in a docker container.
Is there a proper way to make this Pytorch Lightning model run on GPU?
Lightning instructions say not to use model.to(device), but it appears to work just like Pytorch. Reason for instructions to avoid a side effect?
I started reading about ONNX, but would rather just have an easy way to specify GPU since the interactive setup works perfectly with cpu.
My understanding is that "Remove any .cuda() or to.device() calls" is only for using with the Lightning trainer, because the trainer handles that itself.
If you don't use the trainer, a LightningModule module is basically just a regular PyTorch model with some naming conventions. So using model.to(device) is how to run on GPU.

Why use Caffe2 or Core-ML instead of LibTorch(.pt file) on iOS?

It seems like there are several ways to run Pytorch models on iOS.
PyTorch(.pt) -> onnx -> caffe2
PyTorch(.pt) -> onnx -> Core-ML (.mlmodel)
PyTorch(.pt) -> LibTorch (.pt)
PyTorch Mobile?
What is the difference between the above methods?
Why people use caffe2 or Core-ml (.mlmodel), which requires model format conversion, instead of LibTorch?
Core ML can use the Apple Neural Engine (ANE), which is much faster than running the model on the CPU or GPU. If a device has no ANE, Core ML can automatically fall back to the GPU or CPU.
I haven't really looked into PyTorch Mobile in detail, but I think it currently only runs on the CPU, not on the GPU. And it definitely won't run on the ANE because only Core ML can do that.
Converting models can be a hassle, especially from PyTorch which requires going through ONNX first. But you do end up with a much faster way to run those models.

Object detection slow and does not use GPU

I need to use Tensorflow Object Detection API to make some classification connected with recognition.
My problem is that using the API for detection with a pretrained coco model takes too much time and for sure does not use the GPU. I checked my tensorflow-gpu installation on different scripts and it works fine, but when I use this model for detection I can only see increse in CPU usage.
I checked different version of tensorflow (1.12, 1.14), different combinations of CUDA Toolkit (9.0, 10.0) and CuDNN (7.4.2, 7.5.1, 7.6.1) but it is all the same, also tried it on both Windows 7 and Ubuntu 16.04, no difference. My project however requires much faster detection time.
System information:
System: Windows 7, Ubuntu 16.04
Tensorflow: 1.12, 1.14
GPU: GTX 970
Run following python code, if it detects GPU then you can use GPU for training otherwise there is some problem,
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
One more thing, just because your CPU is utilizing does not mean GPU is not at work. CPU always be busy, GPU should also spike when you are training.
Paste the output of above code in the comment if you are not sure about the output.
Edit: After chat with OP on comments, I see the suggested code, and it is using pretrained model, so no training happening here. You are using model and not training a new model. So no gpu is being used.

Gensim build_vocab taking too long

I'm trying to train a doc2vec model using the gensim library on 50 million sentences of variable length.
Some tutorials (eg. https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb) have a model.build_vocab step before the actual training process. This part has been running for 3 hours now without any updates.
Is this step necessary for the training process? Why could this step be taking so long since it's just a linear pass over the data?
Using gensim version 3.4.0 with python 3.6.0
The build_vocab() step to discover all words, then set-up the known-vocabulary structures, is required. (Though, if you supply your corpus as an argument to Doc2Vec, both the build_vocab() and train() will be done automatically.)
You should enable Python logging at the INFO level to see logged information about the progress of this, and other long-running gensim steps. This will help you see if progress is truly being made, or has stopped or slowed at some point.
If the vocabulary-discovery starts fast but then slows, perhaps your system has too little memory and has begun using very-slow virtual memory (swapping). If it seems to stop, perhaps there's a silent error in your method of reading the corpus. If it's just slow the whole way, perhaps there's something wrong with your method of reading the corpus.

Resources