Airflow - How to run a KubernetesPodOperator with a non exiting command - apache-spark

I'm trying to set up a DAG that will create a Spark Cluster in the first task, submit spark applications to the cluster in interim tasks, and have finally teardown the Spark Cluster in the last task.
The approach I'm attempting right now is to use KubernetesPodOperators to create Spark Master and Worker pods. The issue is that they run a spark daemon which never exits. The fact that the command called on the pod never exits means that those tasks gets stuck in airflow in a running phase. So, I'm wondering if there's a way to run the spark daemon and then continue on to the next tasks in the DAG?

The approach I'm attempting right now is to use KubernetesPodOperators to create Spark Master and Worker pods.
Apache Spark provides working support for executing jobs in a Kubernetes cluster. It delivers a driver that is capable of starting executors in pods to run jobs.
You don't need to create Master and Worker pods directly in Airflow.
Rather build a Docker image containing Apache Spark with Kubernetes backend.
An example Dockerfile is provided in the project.
Then submit the given jobs to the cluster in a container based off this image by using KubernetesPodOperator. The following sample job is adapted from documentation provided in Apache Spark to submit spark jobs directly to a Kubernetes cluster.
from airflow.operators.kubernetes_pod_operator import KubernetesPodOperator
kubernetes_full_pod = KubernetesPodOperator(
task_id='spark-job-task-ex',
name='spark-job-task',
namespace='default',
image='<prebuilt-spark-image-name>',
cmds=['bin/spark-submit'],
arguments=[
'--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>',
'--deploy-mode cluster',
'--name spark-pi',
' --class org.apache.spark.examples.SparkPi',
'--conf spark.executor.instances=5',
'--conf spark.kubernetes.container.image=<prebuilt-spark-image-name>',
'local:///path/to/examples.jar'
],
#...
)

Related

How to run multiple spark jobs on k8s cluster with simple scheduler

My main Intent is to get Spark (3.3) running on k8s with HDFS.
I went though the Spark website and got the spark pi program running on a k8s cluster in the form of spark-submit command. I read that if we submit multiple jobs to k8s cluster the k8s may end up starving all pods -- meaning there is no queueing in place, no scheduler (like yarn) which keeps a check on resources and arranges the tasks across the nodes.
So, my question is: what is the simplest way to write a scheduler in k8s? I read about volcano -- but it's not yet in GA. I read about Gang, Younikorn -- but I don't see much community support.

Monitor Spark with Prometheus when Spark clusters are spined up just when needed

We run spark over Kubernetes and we spin up a spark driver and executors for a lot of our tasks (not a spark task). After the task is finished we spin the cluster (on Kubernetes) down and spin up another one when needed (There could be a lot running simultaneously).
The problem I have is that I can't monitor it with Prometheus because I do not have a diver that is always "alive" that I can pull information on the executors from.
Is there a solution for that kind of architecture?

What's the most elegant/right way to stop a spark job running on a Kubernetes cluster?

I'm new to apache spark and I'm trying to run a spark job using spark-submit on my Kubernetes cluster. I was wondering if there's a right way to stop spark jobs once the driver and executor pods are spawned? Would deleting the pods themselves be enough?
Thanks!
When you will delete executor it will be recreated again and spark application will work. However if you will delete driver pod it will stop application.
So killing driver pod is actually the way to stop the Spark
Application during the execution.
As you are new to Spark and you want to run it on Kubernetes, you should check this tutorial.
At present the only way to stop Spark job running on Kuberentes is to delete the Driver Pod (unless you have an app controlling Spark context which is able to manipulate it). Since all other job-related resources are linked to Spark Driver Pod with such as called ownerReferences, they will be removed automatically by Kubernetes.
It should clean things up when the job completes automatically.

Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
What is best way to track Spark job using Airflow if even I submitted?
My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:
Specifying remote master IP: Requires modifying global configurations / environment variables
Using SSHOperator: SSH connection might break
Using EmrAddStepsOperator: Dependent on EMR
Regarding tracking
Livy only reports state and not progress (% completion of stages)
If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)
Other considerations
Livy doesn't support reusing SparkSession for POST/batches request
If that's imperative, you'll have to write your application code in PySpark and use POST/session requests
References
How to submit Spark jobs to EMR cluster from Airflow?
livy/examples/pi_app
rssanders3/livy_spark_operator_python_example
Useful links
How to submit Spark jobs to EMR cluster from Airflow?
Remote spark-submit to YARN running on EMR

Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

Im using kubernetes cluster on AWS to run spark jobs ,im using spark 2.3 ,now i want to run spark-submit from AWS lambda function to k8s master,would like to know if there is any REST interface to run Spark submit on k8s Master?
Unfortunately, it is not possible for Spark 2.3, in case you are using native Kubernetes support.
Based on description from deployment instruction, submission process contains several steps:
Spark creates a Spark driver running within a Kubernetes pod.
The driver creates executors which are also running within Kubernetes pods
The driver connects to them, and executes application code
When the application completes, executor pods terminate and are cleaned up, but the driver pod persists its logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
So, in fact, you have no place to submit a job until you start a submission process, which will launch the first Spark's pod (driver) for you. Only once application completes, everything is terminated.
Please also see similar answer for this question under the link

Resources