We are using Airflow to schedule our jobs on EMR and currently we want to use apache Livy to submit Spark jobs via Airflow
I need more guidance on below :
Which Airflow-Livy operator we should use for python 3+ pyspark and scala jobs.
I have seen below :
https://github.com/rssanders3/airflow-spark-operator-plugin
and
https://github.com/panovvv/airflow-livy-operators
Wants to know more about stable AirflowLivy operator anyone using in production probably in AWS stack.
Also Step by step installation guide for integration.
I would recommend using LivyOperator from https://github.com/apache/airflow/blob/master/airflow/providers/apache/livy/operators/livy.py
Currently, it is only available in Master but you could copy-paste the code and use that as a Custom Operator till we backport all the new operators for Airflow 1.10.* series
Related
We have written a program to fetch data from different sources, make modifications and write the modified data into a MySQL database. The program uses Apache spark for the ETL process, and makes use of spark Java API for this. Will be deploying the live application in Yarn or Kubernetes.
I need to run the program as a scheduled job, say with an interval of five minutes. Did some research and got different suggestions including this from blogs and articles, like plain cron job, AWS glue, Apache Airflow etc for scheduling a spark application. From my reading, it seems I can't run my code (Spark java API) using AWS Glue as it supports only Python and Scala.
Can someone provide insights or suggestions on this? Which is the best option for running a spark application (in Kubernates or Yarn) as a scheduled job?
Is there an option for this in Amazon EMR? Thanks in advance.
The best option I think and I used before is a cronjob, either :
From inside your container with crontab -e with logging seting in case of failure such as:
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/spark/bin
40 13 * * * . /PERSIST_DATA/02_SparkPi_Test_Spark_Submit.sh > /PERSIST_DATA/work-dir/cron_logging/cronSpark 2>&1
OR with a Kubernetes Cronjob, see here for different settings
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/
I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
What is best way to track Spark job using Airflow if even I submitted?
My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:
Specifying remote master IP: Requires modifying global configurations / environment variables
Using SSHOperator: SSH connection might break
Using EmrAddStepsOperator: Dependent on EMR
Regarding tracking
Livy only reports state and not progress (% completion of stages)
If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)
Other considerations
Livy doesn't support reusing SparkSession for POST/batches request
If that's imperative, you'll have to write your application code in PySpark and use POST/session requests
References
How to submit Spark jobs to EMR cluster from Airflow?
livy/examples/pi_app
rssanders3/livy_spark_operator_python_example
Useful links
How to submit Spark jobs to EMR cluster from Airflow?
Remote spark-submit to YARN running on EMR
Like some article that I previously read. It said that in new Kubernetes version, already include Spark capabilities. But with some different ways such as using KubernetesPodOperator instead of using BashOperator / PythonOperator to do SparkSubmit.
Is that the best practice to Combine Airflow + Kubernetes is to remove Spark and using KubernetesPodOperator to execute the task?
Which is have a better performance since Kubernetes have AutoScaling that Spark doesn’t have.
Need someone expert in Kubernetes to help me explain this. I’m still newbie with this Kubernetes, Spark, and Airflow things. :slight_smile:
Thank You.
in new Kubernetes version, already include Spark capabilities
I think you got that backwards. New versions of Spark can run tasks in a Kubernetes cluster.
using KubernetesPodOperator instead of using BashOperator / PythonOperator to do SparkSubmit
Using Kubernetes would allow you to run containers with whatever isolated dependencies you wanted.
Meaning
With BashOperator, you must distribute the files to some shared filesystem or to all the nodes that ran the Airflow tasks. For example, spark-submit must be available on all Airflow nodes.
Similarly with Python, you ship out some zip or egg files that include your pip/conda dependency environment
remove Spark and using KubernetesPodOperator to execute the task
There is still good reasons to run Spark with Airflow, but instead you would be packaging a Spark driver container to execute spark-submit inside a container against the Kubernetes cluster. This way, you only need docker installed, not Spark (and all dependencies)
Kubernetes have AutoScaling that Spark doesn’t have
Spark does have Dynamic Resource Allocation...
One more solution which may help you is to use Apache Livy on Kubernetes (PR: https://github.com/apache/incubator-livy/pull/167) with Airflow HttpOperator.
I am new to Spark. i have developed a pyspark script though the jupyter notebook interactive UI installed in our HDInsight cluster. A of now I ran the code from the jupyter itself but now I have to automate the script. I tried to use Azure Datafactory but could not find a way to run the pyspark script from there. Also tried to use oozie but could not figure out how to use it.I have tried by saving the notebook and reopened it and ran all cells but it is like manual way.
Please Help me to schedule a pyspark job in microsoft Azure.
I searched a discussion about the best practice to run scheduled jobs like crontab with Apache Spark for pyspark, which you might reviewed.
If without oozie, I have a simple idea that is to save jupyter notebook to local and write a shell script to submit the python script to HDInsight Spark via Livy with linux crontab as scheduler. As reference, you can refer to there as below.
IPython Notebook save location
How can I configure pyspark on livy to use anaconda python instead of the default one
Submit Spark jobs remotely to an Apache Spark cluster on HDInsight using Livy
Hope it helps.
I have several spark jobs on a EMR cluster using yarn that must run on a regular basis and are submitted from Jenkins. Currently the Jenkins machine will ssh into the master node on EMR where a copy of the code is ready in a folder to be executed. I would like to be able to clone my repo into the jenkins workspace and submit the code from Jenkins to be executed on the cluster. Is there a simple way to do this? What is the best way to deploy spark from Jenkins?
You can use this rest api to call http requests from Jenkins to Start/Stop the jobs
If you have Python in Jenkins, implement script using Boto3 is a good, easy, flexible and powerful option.
You can manage EMR (So Spark) creating the full cluster or adding jobs to an existing one.
Also, using the same library, you can manage all AWS services.