I am learning ML and i want to re train a AI model for lane detection.
I want to be familiar with the ML training process. The accuracy/result is not my primary goal and i do not need a best ML model for lane detection.
I found this AI model and want to try it out. But i have been facing a problem:
I do not have a GPU, so i wish i can train this model with my CPU. But sadly some part of this code is written with CUDA. Is there a way, i can convert this GPU code to CPU code only?
Should i find another AI-model only for the CPU training?
you can use the tensor.to(device) command to move a tensor to a device.
The .to() command is also used to move a whole model to a device, like in the post you linked to.
Another possibility is to set the device of a tensor during creation using the device= keyword argument, like in t = torch.tensor(some_list, device=device)
To set the device dynamically in your code, you can use
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
to set cuda as your device if possible.
Above is the answer for how to add CUDA in the code. SO Use Cntrl + F and remove all the keywords which forces code to run on GPU. Such as "device", "to"
Related
I build my own model with Keras Premade Models in tensorflow lattice using python3.7 and save the trained model. However, when I use the trained model for predicting, the speed of predicting each data point is at millisecond level, which seems very slow. Is there any way to speed up the predicting process for tfl?
There are multiple ways to improve speed, but they may involve a tradeoff with prediction accuracy. I think the three most promising options are:
Reduce the number of features
Reduce the number of lattices per feature
Use an ensemble of lattice models where every lattice model only gets a subsets of the features and then average the predictions of the different models (like described here)
As the lattice model is a standard Keras model, I recommend trying OpenVINO. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime. OpenVINO is optimized for Intel hardware, but it should work with any CPU.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. If you care about latency, I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement. If you care about throughput, change the value to THROUGHPUT or CUMULATIVE_THROUGHPUT.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
I am trying to run my PyTorch model for ASR on an arm based device without gpu. As far as I know, arm does not support MKL which ATen uses. Naturally, I am getting the following error when I try to make inference:
RuntimeError: fft: ATen not compiled with MKL support
How can I solve this problem? Are there any alternatives that I can use?
If your target device is mobile, it's reasonable to try converting it to TorchScript with Pytorch Mobile first. TorchScript is an intermediate representation of a PyTorch model that can then be run in mobile environment.
https://pytorch.org/mobile/home/
I solved this issue by bypassing PyTorch's stft implementation. This may not be feasible for everyone, but in my case it allowed me to make predictions using my model with no issues on arm device.
The problem stemmed from _VF.stft call in packages/torch/functional.py.
I changed the line
return _VF.stft(input, n_fft, hop_length, win_length, window, normalized, onesided, return_complex)
with:
librosa_stft = librosa.stft(input.cpu().detach().numpy().reshape(-1), n_fft, hop_length, win_length, window="hann", center=True, pad_mode=pad_mode)
librosa_stft = np.array([[a.real, a.imag] for a in librosa_stft])
librosa_stft = np.transpose(librosa_stft, axes=[0, 2, 1])
librosa_stft = np.expand_dims(librosa_stft, 0)
librosa_stft = torch.from_numpy(librosa_stft)
return librosa_stft
This code may be optimized further. I just tried to replicate what PyTorch did by using Librosa. Resulting output is same in both versions in my case. But you should check your outputs to be sure if you decide to use this method.
I currently train my model using GPUs using Pytorch Lightning
trainer = pl.Trainer( gpus=[0,1],
distributed_backend='ddp',
resume_from_checkpoint=hparams["resume_from_checkpoint"])
trainer.fit(model, train_dataloader=train_loader, val_dataloaders=val_loader)
Instructions are also clear for how to run test samples with trainer defined to use GPU
trainer.test(test_dataloader=test_dataloader)
and also how to load a model and use it interactively
model = transformer.Model.load_from_checkpoint('/checkpoints/run_300_epoch_217.ckpt')
results = model(in_data,
I use the later to interface with an interactive system via sockets in a docker container.
Is there a proper way to make this Pytorch Lightning model run on GPU?
Lightning instructions say not to use model.to(device), but it appears to work just like Pytorch. Reason for instructions to avoid a side effect?
I started reading about ONNX, but would rather just have an easy way to specify GPU since the interactive setup works perfectly with cpu.
My understanding is that "Remove any .cuda() or to.device() calls" is only for using with the Lightning trainer, because the trainer handles that itself.
If you don't use the trainer, a LightningModule module is basically just a regular PyTorch model with some naming conventions. So using model.to(device) is how to run on GPU.
It seems like there are several ways to run Pytorch models on iOS.
PyTorch(.pt) -> onnx -> caffe2
PyTorch(.pt) -> onnx -> Core-ML (.mlmodel)
PyTorch(.pt) -> LibTorch (.pt)
PyTorch Mobile?
What is the difference between the above methods?
Why people use caffe2 or Core-ml (.mlmodel), which requires model format conversion, instead of LibTorch?
Core ML can use the Apple Neural Engine (ANE), which is much faster than running the model on the CPU or GPU. If a device has no ANE, Core ML can automatically fall back to the GPU or CPU.
I haven't really looked into PyTorch Mobile in detail, but I think it currently only runs on the CPU, not on the GPU. And it definitely won't run on the ANE because only Core ML can do that.
Converting models can be a hassle, especially from PyTorch which requires going through ONNX first. But you do end up with a much faster way to run those models.
Since we have object detection models based on CNN such as Fast RCNN, Faster RCNN, YOLO (You only look once), ssd (single shot detector).
I have tried running Faster RCNN using CAFFE but backward path is not implemented for CPU mode. Is there any CNN based model which I can use it to train using CPU.
Any help will be appreciated.
faster-rcnn layers in CPU: https://github.com/neuleaf/faster-rcnn-cpu
SSD's original implementation already supports CPU training.