MS Azure Machine Learning - Notebook extremely slow in comparison to GC - azure

I work with a jupyter notebook on different platforms.
The notebook includes a detectron2 algorithm.
The runtime for the training process on different platforms differs significantl (Same Code, Same parameters, same data)
Google Colab
K80 GPU (Successfully used while training)
Data stored on Google Drive
Torch: 1.9.0+cu111
Runtime: 4,5h
Microsoft Azure - Machine Learning
K80 GPU (Successfully used while training)
STANDARD_NC6
Data stored an Azure Files
Torch: 1.9.0+cu111
Runtime: 15h
Why is the expensive Azure ML three times slower than free Google Colab and is there a possibility to improve that?

Related

Fb-Prophet, Apache Spark in Colab and AWS SageMaker/ Lambda

I am using Google-Colab for creating a model by using FbProphet and i am try to use Apache Spark in the Google-Colab itself. Now can i upload this Google-colab notebook in aws Sagemaker/Lambda for free (without charge for Apache Spark and only charge for AWS SageMaker)?
In short, You can upload the notebook without any issue into SageMaker. Few things to keep in mind
If you are using the pyspark library in colab and running spark locally, you should be able to do the same by installing necessary pyspark libs in Sagemaker studio kernels. Here you will only pay for the underlying compute for the notebook instance. If you are experimenting then I would recommend you to use https://studiolab.sagemaker.aws/ to create a free account and try things out.
If you had a separate spark cluster setup then you may need a similar setup in AWS using EMR so that you can connect to the cluster to execute the job.

Tensorflow gpu seems to not work on Azure Machine Learning GPU Compute

I'm new to Azure Machine Learning so I hope I did everything OK.
I created new Jupyter notebook with new Compute Instance of GPU type
But when running
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
From tensorflow docs, I'm getting the number 0 - and when checking what devices I do have it's just some CPUs
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[LogicalDevice(name='/device:CPU:0', device_type='CPU')]
Any ideas why is that and what to do here?
looks like with PyTorch everything is OK, running
import torch
torch.cuda.is_available()
returns True
Packages versions are:
tensorflow 2.4.0
tensorflow-gpu 2.1.0
keras 2.3.1
I believe the problem is with how you installed tensor flow.
Take a look at some examples here.
Information on various Azure GPU configurations can be found here: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu
To understand how your job is using underlying CPU/GPU resources, we have recently released integrated compute performance metrics into the Experimentation UI that we can whitelist the customer for. These have traditionally already existed in Azure monitor (the view you are using below), but can now be used directly by data scientists with their run metrics.

Embarrassingly parallel hyperparameter search via Azure + DataBricks + MLFlow

Conceptual question. My company is pushing Azure + DataBricks. I am trying to understand where this can take us.
I am porting some work I've done locally to the Azure + Databricks platform. I want to run an experiment with a large number of hyperparameter combinations using Azure + Databricks + MLfLow. I am using PyTorch to implement my models.
I have a cluster with 8 nodes. I want to kick off the parameter search across all of the nodes in an embarrassingly parallel manner (one run per node, running independently). Is this as simple as creating a MLflow project and then using the mlflow.projects.run command for each hyperparameter combination and Databricks + MLflow will take care of the rest?
Is this technology capable of this? I'm looking for some references I could use to make this happen.
The short answer is yes, it's possible, but won't be exactly as easy as running a single mlflow command. You can paralelize single-node workflows using spark Python UDFs, a good example of this is this notebook
I'm not sure if this will work with pytorch, but there is hyperopt library that lets you parallelize search across parameters using Spark - it's integrated with mlflow and available in databricks ML runtime. I've been using it only with scikit-learn, but it may be worth checking out

Difference in usecases for AWS Sagemaker vs Databricks?

I was looking at Databricks because it integrates with AWS services like Kinesis, but it looks to me like SageMaker is a direct competitor to Databricks? We are heavily using AWS, is there any reason to add DataBricks into the stack or odes SageMaker fill the same role?
SageMaker is a great tool for deployment, it simplifies a lot of processes configuring containers, you only need to write 2-3 lines to deploy the model as an endpoint and use it. SageMaker also provides the dev platform (Jupyter Notebook) which supports Python and Scala (sparkmagic kernal) developing, and i managed installing external scala kernel in jupyter notebook. Overall, SageMaker provides end-to-end ML services. Databricks has unbeatable Notebook environment for Spark development.
Conclusion
Databricks is a better platform for Big data(scala, pyspark) Developing.(unbeatable notebook environment)
SageMaker is better for Deployment. and if you are not working on big data, SageMaker is a perfect choice working with (Jupyter notebook + Sklearn + Mature containers + Super easy deployment).
SageMaker provides "real time inference", very easy to build and deploy, very impressive. you can check the official SageMaker Github.
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_inference_pipeline
Having worked in both environments within the last year, I specifically remember:
Databricks having easy access to stored databases/tables to query out of and use Scala/Spark within the Jupyter Notebooks. I remember how nice it was to just see and preview the schemas and query quickly and be off to the races for research. I also remember the quick functionality to set up a timed job on a Notebook (re-run every month) and re-scale to job instance types (much cheaper) with some button clicks. These functionalities might exist somewhere in AWS, but I remember it being great in Databricks.
AWS SageMaker + Lambda + API Gateway: Legitimately, today, I worked through the deployment of AWS SageMaker + Lambda + API Gateway, and after getting used to some syntax and specifics of the Lambda + API Gateway it was pretty straightforward. Doing another AWS deployment wouldn't take more than 20 minutes (pending unique specificities). Other things like Model Monitoring and CloudWatch are nice as well. I did notice Jupyter Notebook Kernels for many languages like Python (what I did it in), R, and Scala, along with specific packages already pre-installed like conda and sagemaker ml packages and methods.

Tensorflow Model Deployment in GCP without Tensorflow Serving

Machine Learning Model: Tensorflow Based (version 1.9) & Python version 3.6
Data Input: From Bigquery
Data Output: To Bigquery
Production prediction frequency: Monthly
I have a developed a Tensorflow based machine learning model. I have trained it locally and want to deploy it in Google Cloud Platform for predictions.
The model reads input data from Google Bigquery and the output predictions has to be written in Google Bigquery. There are some data preparation scripts which has to be run before the model prediction is run. Currently I cannot use BigQuery ML in Production as it is in Beta stage. Additionally as it is a batch prediction I don't think Tensorflow Serving will be a good choice.
Strategies which I have tried for deployment:
Use Google ML Engine for prediction: This approach creates output part files on GCS. These have to be combined and written to Google Bigquery. So in this approach I have to spin up a VM just to execute the data preparation script and ML Engine output to Google Bigquery script. This adds up to 24x7 cost of VM just for running two scripts in a month.
Use Dataflow for data preparation script execution along with Google ML Engine: Dataflow uses python 2.7 while the model is developed in Tensorflow version 1.9 and python version 3.6. So this approach cannot be used.
Google App Engine: Using this approach a complete web application has to be developed in order to serve predictions. As the predictions are in batch this approach is not suitable. Additionally flask/django has to be integrated with the code in order to use it.
Google Compute Engine: Using this approach the VM would be running 24x7 just for running monthly predictions and running two scripts. The would cause a lot of cost overhead.
I would like to know what is best deployment approach for Tensorflow models which has some pre and post processing scripts.
Regarding the option 3, Dataflow can read from BigQuery and store the prepared data in BigQuery at the end of the job.
Then you can have Tensorflow use BigQueryReader to data from BigQuery.
Another that you can use is Datalab, this is a notebook in which you can prepare your data and then use it for your prediction.
I've also not found this process flow easy or intuitive. There are two new updates which might help in your project:
BigQuery ML now allows you to import TensorFlow models link - there are some limitations but this may eliminate some of the back and forth data movement between BQ and cloud storage or other environments.
Cloud DataFlow supports Python 3 in alpha (check the Apache Beam roadmap - link )

Resources