Difference in usecases for AWS Sagemaker vs Databricks?

Difference in usecases for AWS Sagemaker vs Databricks? - apache-spark

I was looking at Databricks because it integrates with AWS services like Kinesis, but it looks to me like SageMaker is a direct competitor to Databricks? We are heavily using AWS, is there any reason to add DataBricks into the stack or odes SageMaker fill the same role?

SageMaker is a great tool for deployment, it simplifies a lot of processes configuring containers, you only need to write 2-3 lines to deploy the model as an endpoint and use it. SageMaker also provides the dev platform (Jupyter Notebook) which supports Python and Scala (sparkmagic kernal) developing, and i managed installing external scala kernel in jupyter notebook. Overall, SageMaker provides end-to-end ML services. Databricks has unbeatable Notebook environment for Spark development.
Conclusion
Databricks is a better platform for Big data(scala, pyspark) Developing.(unbeatable notebook environment)
SageMaker is better for Deployment. and if you are not working on big data, SageMaker is a perfect choice working with (Jupyter notebook + Sklearn + Mature containers + Super easy deployment).
SageMaker provides "real time inference", very easy to build and deploy, very impressive. you can check the official SageMaker Github.
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_inference_pipeline

Having worked in both environments within the last year, I specifically remember:
Databricks having easy access to stored databases/tables to query out of and use Scala/Spark within the Jupyter Notebooks. I remember how nice it was to just see and preview the schemas and query quickly and be off to the races for research. I also remember the quick functionality to set up a timed job on a Notebook (re-run every month) and re-scale to job instance types (much cheaper) with some button clicks. These functionalities might exist somewhere in AWS, but I remember it being great in Databricks.
AWS SageMaker + Lambda + API Gateway: Legitimately, today, I worked through the deployment of AWS SageMaker + Lambda + API Gateway, and after getting used to some syntax and specifics of the Lambda + API Gateway it was pretty straightforward. Doing another AWS deployment wouldn't take more than 20 minutes (pending unique specificities). Other things like Model Monitoring and CloudWatch are nice as well. I did notice Jupyter Notebook Kernels for many languages like Python (what I did it in), R, and Scala, along with specific packages already pre-installed like conda and sagemaker ml packages and methods.

Related

Fb-Prophet, Apache Spark in Colab and AWS SageMaker/ Lambda

I am using Google-Colab for creating a model by using FbProphet and i am try to use Apache Spark in the Google-Colab itself. Now can i upload this Google-colab notebook in aws Sagemaker/Lambda for free (without charge for Apache Spark and only charge for AWS SageMaker)?

In short, You can upload the notebook without any issue into SageMaker. Few things to keep in mind
If you are using the pyspark library in colab and running spark locally, you should be able to do the same by installing necessary pyspark libs in Sagemaker studio kernels. Here you will only pay for the underlying compute for the notebook instance. If you are experimenting then I would recommend you to use https://studiolab.sagemaker.aws/ to create a free account and try things out.
If you had a separate spark cluster setup then you may need a similar setup in AWS using EMR so that you can connect to the cluster to execute the job.

Embarrassingly parallel hyperparameter search via Azure + DataBricks + MLFlow

Conceptual question. My company is pushing Azure + DataBricks. I am trying to understand where this can take us.
I am porting some work I've done locally to the Azure + Databricks platform. I want to run an experiment with a large number of hyperparameter combinations using Azure + Databricks + MLfLow. I am using PyTorch to implement my models.
I have a cluster with 8 nodes. I want to kick off the parameter search across all of the nodes in an embarrassingly parallel manner (one run per node, running independently). Is this as simple as creating a MLflow project and then using the mlflow.projects.run command for each hyperparameter combination and Databricks + MLflow will take care of the rest?
Is this technology capable of this? I'm looking for some references I could use to make this happen.

The short answer is yes, it's possible, but won't be exactly as easy as running a single mlflow command. You can paralelize single-node workflows using spark Python UDFs, a good example of this is this notebook
I'm not sure if this will work with pytorch, but there is hyperopt library that lets you parallelize search across parameters using Spark - it's integrated with mlflow and available in databricks ML runtime. I've been using it only with scikit-learn, but it may be worth checking out

Apache Zeppelin Notebook Deployment

I'm new to ETL development with PySpark and I've been writing my scripts as paragraphs on Apache Zeppelin Notebooks. I was curious what the typical flow was for a deployment process? How are you converting your code from a Zeppelin Notebook to your ETL pipeline?
Thanks!

Well that heavily depends on the sort of ETL that you're doing.
If you want to keep the scripts in the notebooks and you just need to orchestrate their execution then you have a couple options:
Use Zeppelin's built-in scheduler
Use cron to launch your notebooks via curl commands and Zeppelin's REST API
But if you already have an up-and-running workflow management tool like Apache Airflow then you can add new tasks that launch the aforementioned curl commands to trigger the notebooks (with Airflow, you can use BashOperator or PythonOperator), but keep in mind that you'll need some workarounds to have a sequential execution of different notes.
One major tech company that's betting heavily on notebooks is Netflix (you can take a look at this), and they developed a set of tools to improve the effeciency of notebook-based ETL pipelines, like Commuter and Papermill. They're more into Jupyter, so Zeppelin compatibility is still not provided, but the core concepts should be the same when working with Zeppelin.
For more on Netflix' notebook-based pipelines, you can refer to this article shared on their tech blog.

Tensorflow Model Deployment in GCP without Tensorflow Serving

Machine Learning Model: Tensorflow Based (version 1.9) & Python version 3.6
Data Input: From Bigquery
Data Output: To Bigquery
Production prediction frequency: Monthly
I have a developed a Tensorflow based machine learning model. I have trained it locally and want to deploy it in Google Cloud Platform for predictions.
The model reads input data from Google Bigquery and the output predictions has to be written in Google Bigquery. There are some data preparation scripts which has to be run before the model prediction is run. Currently I cannot use BigQuery ML in Production as it is in Beta stage. Additionally as it is a batch prediction I don't think Tensorflow Serving will be a good choice.
Strategies which I have tried for deployment:
Use Google ML Engine for prediction: This approach creates output part files on GCS. These have to be combined and written to Google Bigquery. So in this approach I have to spin up a VM just to execute the data preparation script and ML Engine output to Google Bigquery script. This adds up to 24x7 cost of VM just for running two scripts in a month.
Use Dataflow for data preparation script execution along with Google ML Engine: Dataflow uses python 2.7 while the model is developed in Tensorflow version 1.9 and python version 3.6. So this approach cannot be used.
Google App Engine: Using this approach a complete web application has to be developed in order to serve predictions. As the predictions are in batch this approach is not suitable. Additionally flask/django has to be integrated with the code in order to use it.
Google Compute Engine: Using this approach the VM would be running 24x7 just for running monthly predictions and running two scripts. The would cause a lot of cost overhead.
I would like to know what is best deployment approach for Tensorflow models which has some pre and post processing scripts.

Regarding the option 3, Dataflow can read from BigQuery and store the prepared data in BigQuery at the end of the job.
Then you can have Tensorflow use BigQueryReader to data from BigQuery.
Another that you can use is Datalab, this is a notebook in which you can prepare your data and then use it for your prediction.

I've also not found this process flow easy or intuitive. There are two new updates which might help in your project:
BigQuery ML now allows you to import TensorFlow models link - there are some limitations but this may eliminate some of the back and forth data movement between BQ and cloud storage or other environments.
Cloud DataFlow supports Python 3 in alpha (check the Apache Beam roadmap - link )

Are there any runners supported for apache beam python besides google cloud dataflow?

I have been building python pipelines using google cloud dataflow and apache beam for about a year. I am leaving the google cloud environment for a university cluster, which has spark installed. It looks like the spark runner is only for java (https://beam.apache.org/documentation/runners/spark/)? Are there any suggestions on how to run python apache beam pipelines outside of cloud dataflow?

As of right now, this is not yet possible, but portability across runners and languages is the highest priority and the most active area of development in Beam right now, and I think the portable Flink runner is very close to being able to run simple pipelines in Python, with portable Spark runner development to commence soon (and share lots of code with Flink). Stay tuned and follow the dev# mailing list!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string