I am new to TFX and I would like to know if it's possible to create a TFX pipeline that can train Keras models on TPUs nodes using TPUStrategy from a TPUClusterResolver. Looking at TFX documentation, it is not clear to me if such is possible.
Any feedback about this would be much appreciated! Thank you!
Note: Im using TF v2
Yes, TFX Pipeline supports TPU out-of-the-box for TPU-enabled Estimators, the same model_config.model can be used to configure an estimator regardless of whether it is trained on CPU or TPU.
To train in TPU, you should follow the following actions:
Configure a tpu_footprint in your deployment file and the desired
trainer stanza of your service file.
Include an empty TPUSettings
message in your custom_config file.
And as an optional action,
configure a TPUEstimator-specific args in
model_config.tpu_estimator.
Related
We have multiple Databricks Workspaces on Azure. On one of them we trained multiple models and registered them in the MLflow registry. Our goal is to move those model from one databricks workspace to another and so far, i could not find a straight forwared way to do this except running the training script again on the new databricks workspace.
Downloading the model an registering them in the new workspace didn't work so far. Should I create a "dummy" training script, that just loads the model, does nothing with it and then logs it away in the new workspace?
Seems to me like databricks never anticipated, that someone might want to migrate ML models?
The model registry caches the model-related artifacts in the registry. So the shared model registry will have a copy of all these artifacts - MLmodel, conda.yaml, etc. from the source workspace. The actual run is not cached on the shared registry. Instead there is a pointer to the source run in the "run_link" field of a model version.
The mlflow-export-import tool uses the public MLflow API to do best effort migration. For OSS MLflow it works quite well. For Databricks MLflow the main limitation is that we cannot export notebook revisions associated with an MLflow run since there is no API endpoint for this. For registered models, it can migrate the run associated with the model subject to above caveat.
https://github.com/amesar/mlflow-export-import#registered-models-1
You can use MLflow APIs for that, for example Python Tracking API, with get_registered_model, get_run, create_registered_model, etc. One of the Databricks solution architect developed a project for exporting and importing models/experiments/runs on top of the MLflow APIs.
You can also consider use of the shared mflow registry (sometimes is called central model registry) - when the training happens in specific workspace, but models are logged (optionally) and registered in central place - in this model it could be easier to maintain permissions, do backups, etc.
Do we have a similar approach in Azure as provided in below GCP link ?
https://cloud.google.com/ai-platform/training/docs/custom-containers-training
Appreciate the assistance in advance.
It's not conventional as Azure has different ways to approach this problem, but it can be done and I've done it before using custom images.
Here is the doc on custom containers.
Every pipeline step occurs within containers. If you look inside your Azure resources you will see a container registry dedicated for your AML instance. You can even download them and work on them locally if you have access. You can also push your own custom containers.
When you use your container (which is assumed to have some code in it that takes an input and trains a model), you specify your execution with the InferenceConfig. You will be able to see the outputs and logs in the AML interface.
In line ~67 here you can see an example of using a custom environment in a pipeline script step. I'm using this to specify a runtime environment and then pushing my code into it at runtime, but you could easily modify this to create custom training step that mimics what's going on in the google doc.
Depending on your problem there may be easier ways to do train a custom model, but I feel that this answers OP's specific question.
I am trying to train a CNN (ResNet50 for now) using Keras on Google Colab with their TPU support. The TPU VM on Colab has a small local disk size, so I cannot fit my training images on it.
I tried uploading the train/test images to Google drive but it appears to be rather slow to access the files from there on Colab. I set up a Google Cloud Storage (GCS) bucket to upload the data to. But cannot find good examples on how to connect the bucket to Keras and the TPU for training.
On TensorFlow website they suggest just using the GCS as a filesystem. But there is something about the fileset having to use "tf.io.gfile" for access. What does it mean with regards to Keras?
The Shakespeare TPU exaxmple shows mounting a GCS bucket and using it for model storage. So this way I can mount and reference the bucket. But it does not tell me the way to use GCS for feeding the training data from. All examples I find use some predefined set of images packed with Keras..
Some instructions seem to state that the TPU runs on its own, separate server, and the data should be on GCS for TPU to access it. If I run a Keras generator, do image augmentation, and the feed these to the training system, does this not mean I am continously downloading images over the network to the Colab VM, modifying them, and sending the over the network to the TPU server?
It all seems rather complicated to run a simple CNN model on Keras with TPU. What am I missing here, what is the correct process?
Anyone having a concrete example would be great..
Why do I need Container for AWS SageMaker? If I want to run Scikit Learn on SageMaker's Jupyter notebook for self learning purposes, do I still need to configure Container for it?
What is the minimum configuration on SageMaker I will need if I just want to learn Scikit Learn? For example, I want to run Scikit Learn's Decision Tree algorithm with a set of training data and a set of test data. What do I need to do on SageMaker to perform the tasks? Thanks.
You don't need much. Just an AWS Account with the correlated permissions on your role.
Inside the AWS SageMaker Console you can just run an AWS Notebook Instance with one click. There is Sklearn preinstalled and you can use it out of the box. No special container needed.
As minimum you just need your AWS Account with the correlated permissions to create EC2 Instances and read / write from your S3. Thats all, just try it. :)
Use this as a starting point: Amazon SageMaker – Accelerating Machine Learning
You can also access it via the Jupyter Terminal
If you are not concerned about using Sagemaker's training and deployment features then you just need to create a new conda_python3 notebook and import sklearn.
I too was confused about how to take advantage of Sagemaker's train/deploy features with Scikit Learn. The best explanation and most up to date seems to be:
https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/sklearn/README.rst
The brief summary is:
You save your training data to an S3 bucket.
Create a standalone python script that does your training, serializes the training model to a file and saves it to an S3 bucket.
In a notebook on Sagemaker you import the Sagemaker SDK and point it to your training script and data. Sagemaker will then temporarily create an AWS instance to train the model.
Once trained that instance gets automatically destroyed.
Finally you use the Sagemaker SDK to deploy the trained model to another AWS instance. This also automatically creates an endpoint that can be called to make predictions.
I am trying to deploy a Neural Network that can take a dataset and predict answers on the same. I am using AWS Lambda (PYTHON) to do the same. I understand that Keras cannot be accessed using the inline code editor. So how do I go about this? How do I upload my code to Lambda such that it supports keras?
The process is a bit lengthy. Here's an overview:
Launching an EC2 instance with the appropriate AMI
Creating a virtualenv on said instance
Installing Theano, Keras and h5py with pip
Staging python libraries and shared C++ libraries together for deployment
Writing a handler function
Creating an archive of the artifacts from steps 4 and 5
Deploying the archive to AWS Lambda
Then, if you need to improve Theano performance, you can deploy gcc to Lambda as well, so that it's able to make use of a compiler.
These steps, including the last step, are described at length by Amazon's own Abhishek Patnia. You'll save yourself some time if you read through it first, as some of the latter steps amend what he does initially.
use serverless framework, it allows you to deploy your lambda code and its dependencies on aws and other cloud service providers without having to deal with the environments differences (local vs aws linux)
https://www.serverless.com/framework/docs/providers/aws/examples/hello-world/python/