I would like to load a hugging face base model (xlm-roberta-base) from Google Cloud Storage without downloading to my local directory if possible. My model is a PyTorch model which builds on the hugging face base model. How can I go about this?
I can download blobs with a particular prefix from my bucket, but I would like to do it without actually persisting the downloaded blobs in my local directory.
Related
We are trying to use the container preview of form recogniser,OCR and labeltool and have following questions:
Is there any software which can help us to classify similar kind of documents. This will help us to categorize document and create training dataset
Is there any way to give the model user-defined name. Following is output from model query API.It is difficult to tie it back to different kind of models:
{
"modelId": "f136f65b-bb94-493b-a798-a3e8023ea1b5",
"status": "ready",
"createdDateTime": "2020-05-06T21:35:58+00:00",
"lastUpdatedDateTime": "2020-05-06T21:36:06+00:00"
}
I can see models file stored in \output\subscriptions\global\models where /output directory shared container in docker compose file. Is it possible to import this model to new containers.
Models have json and gz file with the same nae as model id
I am also attaching docker compose file for your reference
Is there way to fine tune or update same custom model(same model id) with model training data
We were also trying the labeltool but it only takes Azure blob as input. Is it possible to provide input same as we do for training of form recognizer.
We are struggling to get this setup and if it is not resolved we might to start looking to alternatives.
Following are answers to your questions:
To classify documents you can use custom vision to build a document classifier or use text classification and OCR. In addition you can use the Form Recognizer train without labels run it on the training data and use the cluster option within the model to classify similar documents and pages in the training dataset.
Friendly Model name is not yet available in Form Recognizer, its a future feature on our roadmap but not available yet.
Models can't be copied between containers, you can use the same data-set to train a model in a different container. Models can be copied between subscriptions, resources and regions when using the Form Recognizer cloud service.
Each train creates a new model ID in order not to overwrite the previous model you can't update existing models.
Form Recognizer v2.0 release is not yet available in containers, only Form Recognizer v1.0 release is currently available in containers. Form Recognizer v2.0 will be also available in containers shortly. When using containers release all the data remains on premise and the labeling tool once available for the v2.0 containers release will also take as input a local or mounted disk and not blob.
Thanks !
Neta - MSFT
I am using Azure Cognitive Services, aka CustomVision website, to create, train and test models. I understand the main goal of this site is to create and API which can be called to run your model in production. I should mention I am using this to do object detection.
There are times when you have to support running offline (meaning you don't have a connection to Azure, etc...). I believe Microsoft knows and understands this because they have a feature which allows you to export your model in many different formats (such as TensorFlow, ONNX, etc...).
The issue I am having is particularly when you export to TensorFlow, which is what I need, it will only download the frozen model graph (model.pb). However, there are times when you need either the .pbtxt file that goes along with the model or the config file. I know you can generate a pbtxt file but for that you need the .config.
Also, there is little to no information about your model once you export it, such as what the input image size should be. I would like to see this better documented somewhere. For example, is it 300x300, etc... Without getting the config or pbtxt along with the model, you have to figure this out by loading your model into a TensorBoard or something similar to figure out the input information (size, name, etc..). Furthermore, we don't even know what the baseline of the model is, is it ResNet, SSD, etc...
So, anybody know how I can get these missing files when I export a model? Or, anybody know how you can generate a pbtxt when all you have is the frozen graph .pb file?
If not, I would recommend these as improvements for the Azure Cognitive services team. With all of this missing data or information, it is really hard to consume the exported model.
Thanks!
Many model architectures allow you to change the network input size, such as Yolo, which is the architecture exported from Custom Vision. Including a fixed input size does somewhere does not make sense in this case.
Netron will be your good friend and pretty easy to use to figure out the details about the model.
Custom Vision Service only exports compact domains.For object detection exports there is code to load and run the object detection model in the zip file downloaded(model.pb,labels.txt). Along with the the export model you will find Python code to exercise the model.
I am using LightFM for recommender system
https://github.com/lyst/lightfm
Now I want to move my model to AWS Sagemaker where this is not part of build in algorithm, now I want to train my model using this algorithm and also want to leverage Sagemaker capability for huge data, I am following this link to run my custom model
https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html
Now, it seems I need to install docker image of this algorithm and too much other things too, is it any simple way to train my model without pre-build algorithm.
You'll have to put the algorithm into a docker container and bring it to SageMaker for training. You may want to check out SageMaker sample notebooks to get some examples of preparing the docker images. For example, https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/r_bring_your_own
We are trying to figure out how to host and run many of our existing scikit-learn and R models (as is) in GCP. It seems ML Engine is pretty specific to Tensorflow. How can I train a scikit-learn model on Google cloud platform and manage my model if the dataset is too large to pull into datalab? Can I still use ML Engine or is there a different approach most people take?
As an update I was able to get the python script that trains the scikit-learn model to run by submitting it as a training job to ML Engine but haven't found a way to host the pickled model or use it for prediction.
Cloud ML Engine only supports models written in TensorFlow.
If you're using scikit-learn you might want to look at some of the higher level TensorFlow libraries like TF Learn or Keras. They might help migrate your model to TensorFlow in which case you could then use Cloud ML Engine.
It's possible, Cloud ML has this feature from Dec 2017, As of today it is provided as an early access. Basically Cloud ML team is testing this feature but you can also be part of it. More on here.
Use the following command to deploy your scikit-learn models to cloud ml. Please note these parameters may change in future.
gcloud ml-engine versions create ${MODEL_VERSION} --model=${MODEL} --origin="gs://${MODEL_PATH_IN_BUCKET}" --runtime-version="1.2" --framework="SCIKIT_LEARN"
sklearn is now supported on ML Engine.
Here is a fully worked out example of using fully-managed scikit-learn training, online prediction and hyperparameter tuning:
https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/blogs/sklearn/babyweight_skl.ipynb
I know what I can either upload my data files to the azure ml (as new datasets) or I can use Blobs (and read data within ML experiment). I wonder if particularly one of them is recommended when training machine learning models and creating prediction-related ML solutions.
My goal of using Azure is to cluster users based on a various of features. I have a large dataset (~ 50GB). I wonder if you have any recommendations.
I appreciate any help!
As stated at Azure Machine Learning Frequently Asked Questions: "For datasets larger than a few GB, you should upload data to Azure Storage or Azure SQL Database or use HDInsight, rather than directly uploading from local file."
Also please note the maximum sizes of datasets for modules in the Machine Learning Studio. These limits are listed as a part of the same FAQ linked above.