Fb-Prophet, Apache Spark in Colab and AWS SageMaker/ Lambda - apache-spark

I am using Google-Colab for creating a model by using FbProphet and i am try to use Apache Spark in the Google-Colab itself. Now can i upload this Google-colab notebook in aws Sagemaker/Lambda for free (without charge for Apache Spark and only charge for AWS SageMaker)?

In short, You can upload the notebook without any issue into SageMaker. Few things to keep in mind
If you are using the pyspark library in colab and running spark locally, you should be able to do the same by installing necessary pyspark libs in Sagemaker studio kernels. Here you will only pay for the underlying compute for the notebook instance. If you are experimenting then I would recommend you to use https://studiolab.sagemaker.aws/ to create a free account and try things out.
If you had a separate spark cluster setup then you may need a similar setup in AWS using EMR so that you can connect to the cluster to execute the job.

Related

Difference in usecases for AWS Sagemaker vs Databricks?

I was looking at Databricks because it integrates with AWS services like Kinesis, but it looks to me like SageMaker is a direct competitor to Databricks? We are heavily using AWS, is there any reason to add DataBricks into the stack or odes SageMaker fill the same role?
SageMaker is a great tool for deployment, it simplifies a lot of processes configuring containers, you only need to write 2-3 lines to deploy the model as an endpoint and use it. SageMaker also provides the dev platform (Jupyter Notebook) which supports Python and Scala (sparkmagic kernal) developing, and i managed installing external scala kernel in jupyter notebook. Overall, SageMaker provides end-to-end ML services. Databricks has unbeatable Notebook environment for Spark development.
Conclusion
Databricks is a better platform for Big data(scala, pyspark) Developing.(unbeatable notebook environment)
SageMaker is better for Deployment. and if you are not working on big data, SageMaker is a perfect choice working with (Jupyter notebook + Sklearn + Mature containers + Super easy deployment).
SageMaker provides "real time inference", very easy to build and deploy, very impressive. you can check the official SageMaker Github.
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_inference_pipeline
Having worked in both environments within the last year, I specifically remember:
Databricks having easy access to stored databases/tables to query out of and use Scala/Spark within the Jupyter Notebooks. I remember how nice it was to just see and preview the schemas and query quickly and be off to the races for research. I also remember the quick functionality to set up a timed job on a Notebook (re-run every month) and re-scale to job instance types (much cheaper) with some button clicks. These functionalities might exist somewhere in AWS, but I remember it being great in Databricks.
AWS SageMaker + Lambda + API Gateway: Legitimately, today, I worked through the deployment of AWS SageMaker + Lambda + API Gateway, and after getting used to some syntax and specifics of the Lambda + API Gateway it was pretty straightforward. Doing another AWS deployment wouldn't take more than 20 minutes (pending unique specificities). Other things like Model Monitoring and CloudWatch are nice as well. I did notice Jupyter Notebook Kernels for many languages like Python (what I did it in), R, and Scala, along with specific packages already pre-installed like conda and sagemaker ml packages and methods.

Azure HDInsights Spark Cluster Install External Libraries

I have a HDInsights Spark Cluster. I installed tensorflow using a script action. The installation went fine (Success).
But now when I go and create a Jupyter notebook, I get:
import tensorflow
Starting Spark application
The code failed because of a fatal error:
Session 8 unexpectedly reached final status 'dead'. See logs:
YARN Diagnostics:
Application killed by user..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context. For instructions on how to assign resources see http://go.microsoft.com/fwlink/?LinkId=717038
b) Contact your cluster administrator to make sure the Spark magics library is configured correctly.
I don't know how to fix this error... I tried some things like looking at logs but they are not helping.
I just want to connect to my data and train a model using tensorflow.
This looks like error with Spark application resources. Check resources available on your cluster and close any applications that you don't need. Please see more details here: https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-resource-manager#kill-running-applications

How to connect to Spark EMR from the locally running Spark Shell

I have created a Spark EMR cluster. I would like to execute jobs either on my localhost or EMR cluster.
Assuming I run spark-shell on my local computer how can I tell it to connect to the Spark EMR cluster, what would be the exact configuration options and/or commands to run.
It looks like others have also failed at this and ended up running the Spark driver on EMR, but then making use of e.g. Zeppelin or Jupyter running on EMR.
Setting up our own machines as spark drivers that connected to the core nodes on EMR would have been ideal. Unfortunately, this was impossible to do and we forfeited after trying many configuration changes. The driver would start up and then keep waiting unsuccessfully, trying to connect to the slaves.
Most of our Spark development is on pyspark using Jupyter Notebook as our IDE. Since we had to run Jupyter from the master node, we couldn’t risk losing our work if the cluster were to go down. So, we created an EBS volume and attached it to the master node and placed all of our work on this volume. [...]
source
Note: If you go down this route, I would consider using S3 for storing notebooks, then you don't have to manage EBS volumes.
One way of doing this is to add your spark job as an EMR step to your EMR cluster. For this, you need AWS CLI installed on your local computer
(see here for installation guide), and your jar file on s3.
Once you have aws cli, assuming your spark class to run is com.company.my.MySparkJob and your jar file is located on s3 at s3://hadi/my-project-0.1.jar, you can run the following command from your terminal:
aws emr add-steps --cluster-id j-************* --steps Type=spark,Name=My_Spark_Job,Args=[-class,com.company.my.MySparkJob,s3://hadi/my-project-0.1.jar],ActionOnFailure=CONTINUE

Can Spark access DynamoDb without EMR

I have a set of AWS Instances where Apache Hadoop distribution along with apache spark is setup
I am trying to access DynamoDb through Spark streaming for reading and writing to the table But
During writing the Spark- DynamoDB code, I got to know emr-ddb-hadoop.jar is required to get DynamoDB Input Format and OutputFormat which is present in EMR Cluster only.
After checking few blogs it seems that it is accessible only with EMR Spark.
Is It correct?
However I use standalone JAVA SDK to access Dynamodb which worked fine
I got the solution of the problem.
I downloaded the emr-ddb-hadoop.jar file from EMR and using it in my environment.
Please note: To run the DynamoDB, we only need above jar only.

Connecting IPython notebook to spark master running in different machines

I don't know if this is already answered in SO but I couldn't find a solution to my problem.
I have an IPython notebook running in a docker container in Google Container Engine, the container is based on this image jupyter/all-spark-notebook
I have also a spark cluster created with google cloud dataproc
Spark master and the notebook are running in different VMs but in the same region and zone.
My problem is that I'm trying to connect to the spark master from the IPython notebook but without success. I use this snippet of code in my python notebook
import pyspark
conf = pyspark.SparkConf()
conf.setMaster("spark://<spark-master-ip or spark-master-hostname>:7077")
I just started working with spark, so I'm sure I'm missing something (authentication, security ...),
What I found over there is connecting a local browser over an SSH tunnel
Somebody already did this kind of set up?
Thank you in advance
Dataproc runs Spark on YARN, so you need to set master to 'yarn-client'. You also need to point Spark at your YARN ResourceManager, which requires a under-documented SparkConf -> Hadoop Configuration conversion. You also have to tell Spark about HDFS on the cluster, so it can stage resources for YARN. You could use Google Cloud Storage instead of HDFS, if you baked The Google Cloud Storage Connector for Hadoop into your image.
Try:
import pyspark
conf = pyspark.SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('My Jupyter Notebook')
# 'spark.hadoop.foo.bar' sets key 'foo.bar' in the Hadoop Configuaration.
conf.set('spark.hadoop.yarn.resourcemanager.address', '<spark-master-hostname>')
conf.set('spark.hadoop.fs.default.name', 'hdfs://<spark-master-hostname>/')
sc = pyspark.SparkContext(conf=conf)
For a more permanent config, you could bake these into a local file 'core-site.xml' as described here, place that in a local directory, and set HADOOP_CONF_DIR to that directory in your environment.
It's also worth noting that while being in the same Zone is important for performance, it is being in the same Network and allowing TCP between internal IP addresses in that network that allows your VMs to communicate. If you are using the default network, then the default-allow-internal firewall rule, should be sufficient.
Hope that helps.

Resources