I want to train a PyTorch based model using AWS SageMaker but I don't know how to use the ML Training services. So I was wondering if it is possible to simply train the model using the notebook instance and uploading the model to the S3 bucket? Also, how would I ensure That I do not end up paying more than I need to?
Related
I have developed a SNIPS NLU or similar model. I am trying to deploy the model on spark clusters using pyspark. I am not sure how to do it. Any help?
Is it possible to train a spark/pyspark ML lib model using VertexAI custom container model building? I couldn't find any reference in the vertex ai documents regarding spark model training. For distributed processing model building only options available are PyTorch or TensorFlow.
It is possible with custom containers if you leverage the Spark Kubernetes operator but this is not a well documented workflow and will require complex set up. GCP's preferred way to run Spark jobs is on Dataproc https://cloud.google.com/dataproc which supports PySpark, SparkR, Scala. You can still trigger a Dataproc Spark job from Vertex Pipelines and save the model for predictions in Vertex via MLeap.
I've been using Amazon Sagemaker Notebooks to build a pytorch model for an NLP task.
I know you can use Sagemaker to train, deploy, hyper parameter tuning, and model monitoring.
However, it looks like you have to create an inference endpoint in order to monitor the model's inference performance.
I already have a EC2 instance setup to perform inference tasks on our model, which is currently on a development box and rather not use an endpoint to make
Is it possible to use Sagemaker to train, run hyperparam tuning and model eval without creating an endpoint.
If you don't want to keep an inference endpoint up, one option is to use SageMaker Processing to run a job that takes your trained model and test dataset as input, performs inference and computes evaluation metrics, and saves them to S3 in a JSON file.
This Jupyter notebook example steps through (1) preprocessing training and test data, (2) training a model, then (3) evaluating the model
You can deploy your model on AWS SageMaker by using two approaches: set up an endpoint and create a batch transform job. I guess you probably can try the latter.
The good thing about using a batch transform job is that you can specify the S3 bucket path for both input and output data. When the job is completed, it will upload the output to the s3 path directly.
I am using Kubeflow fairing to train a TensorFlow model on Kubernetes. The training succeeds but now I want to serve a prediction endpoint.
How can I retrieve the saved TensorFlow session (weights, biases etc.) from the training step so that I can do this? At the moment the result of the training step is saved inside the Docker container running on the Kubernetes cluster.
I had misunderstood the scope of Kubeflow fairing - at the time of writing it doesn't support copying the trained model from the fairing job to where the code was run from, nor is this necessarily desirable.
I instead used the Minio instance provisioned by Kubeflow to store and retrieve tarballs of trained models.
Machine Learning Model: Tensorflow Based (version 1.9) & Python version 3.6
Data Input: From Bigquery
Data Output: To Bigquery
Production prediction frequency: Monthly
I have a developed a Tensorflow based machine learning model. I have trained it locally and want to deploy it in Google Cloud Platform for predictions.
The model reads input data from Google Bigquery and the output predictions has to be written in Google Bigquery. There are some data preparation scripts which has to be run before the model prediction is run. Currently I cannot use BigQuery ML in Production as it is in Beta stage. Additionally as it is a batch prediction I don't think Tensorflow Serving will be a good choice.
Strategies which I have tried for deployment:
Use Google ML Engine for prediction: This approach creates output part files on GCS. These have to be combined and written to Google Bigquery. So in this approach I have to spin up a VM just to execute the data preparation script and ML Engine output to Google Bigquery script. This adds up to 24x7 cost of VM just for running two scripts in a month.
Use Dataflow for data preparation script execution along with Google ML Engine: Dataflow uses python 2.7 while the model is developed in Tensorflow version 1.9 and python version 3.6. So this approach cannot be used.
Google App Engine: Using this approach a complete web application has to be developed in order to serve predictions. As the predictions are in batch this approach is not suitable. Additionally flask/django has to be integrated with the code in order to use it.
Google Compute Engine: Using this approach the VM would be running 24x7 just for running monthly predictions and running two scripts. The would cause a lot of cost overhead.
I would like to know what is best deployment approach for Tensorflow models which has some pre and post processing scripts.
Regarding the option 3, Dataflow can read from BigQuery and store the prepared data in BigQuery at the end of the job.
Then you can have Tensorflow use BigQueryReader to data from BigQuery.
Another that you can use is Datalab, this is a notebook in which you can prepare your data and then use it for your prediction.
I've also not found this process flow easy or intuitive. There are two new updates which might help in your project:
BigQuery ML now allows you to import TensorFlow models link - there are some limitations but this may eliminate some of the back and forth data movement between BQ and cloud storage or other environments.
Cloud DataFlow supports Python 3 in alpha (check the Apache Beam roadmap - link )