What's the most elegant/right way to stop a spark job running on a Kubernetes cluster? - apache-spark

I'm new to apache spark and I'm trying to run a spark job using spark-submit on my Kubernetes cluster. I was wondering if there's a right way to stop spark jobs once the driver and executor pods are spawned? Would deleting the pods themselves be enough?
Thanks!

When you will delete executor it will be recreated again and spark application will work. However if you will delete driver pod it will stop application.
So killing driver pod is actually the way to stop the Spark
Application during the execution.
As you are new to Spark and you want to run it on Kubernetes, you should check this tutorial.

At present the only way to stop Spark job running on Kuberentes is to delete the Driver Pod (unless you have an app controlling Spark context which is able to manipulate it). Since all other job-related resources are linked to Spark Driver Pod with such as called ownerReferences, they will be removed automatically by Kubernetes.

It should clean things up when the job completes automatically.

Related

How to run multiple spark jobs on k8s cluster with simple scheduler

My main Intent is to get Spark (3.3) running on k8s with HDFS.
I went though the Spark website and got the spark pi program running on a k8s cluster in the form of spark-submit command. I read that if we submit multiple jobs to k8s cluster the k8s may end up starving all pods -- meaning there is no queueing in place, no scheduler (like yarn) which keeps a check on resources and arranges the tasks across the nodes.
So, my question is: what is the simplest way to write a scheduler in k8s? I read about volcano -- but it's not yet in GA. I read about Gang, Younikorn -- but I don't see much community support.

Why is spark process on my cluster even before I have started my sparksession?

I have set up a dataproc cluster to run my spark jobs on. I have just set up the cluster and have not started any spark session yet. Still, I am seeing spark process, mapreduce process, yarn etc in my top command. What is that about? Should not the spark process start after I have started the SparkSession with configurations of my choice?
These are all background processes and daemons running in the background, running and monitoring the hadoop and spark ecosystem, and waiting for you to submit a request or program, that can be run. They need to be up and running first before you can run a spark app. Pretty normal on Linux.

Airflow - How to run a KubernetesPodOperator with a non exiting command

I'm trying to set up a DAG that will create a Spark Cluster in the first task, submit spark applications to the cluster in interim tasks, and have finally teardown the Spark Cluster in the last task.
The approach I'm attempting right now is to use KubernetesPodOperators to create Spark Master and Worker pods. The issue is that they run a spark daemon which never exits. The fact that the command called on the pod never exits means that those tasks gets stuck in airflow in a running phase. So, I'm wondering if there's a way to run the spark daemon and then continue on to the next tasks in the DAG?
The approach I'm attempting right now is to use KubernetesPodOperators to create Spark Master and Worker pods.
Apache Spark provides working support for executing jobs in a Kubernetes cluster. It delivers a driver that is capable of starting executors in pods to run jobs.
You don't need to create Master and Worker pods directly in Airflow.
Rather build a Docker image containing Apache Spark with Kubernetes backend.
An example Dockerfile is provided in the project.
Then submit the given jobs to the cluster in a container based off this image by using KubernetesPodOperator. The following sample job is adapted from documentation provided in Apache Spark to submit spark jobs directly to a Kubernetes cluster.
from airflow.operators.kubernetes_pod_operator import KubernetesPodOperator
kubernetes_full_pod = KubernetesPodOperator(
task_id='spark-job-task-ex',
name='spark-job-task',
namespace='default',
image='<prebuilt-spark-image-name>',
cmds=['bin/spark-submit'],
arguments=[
'--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>',
'--deploy-mode cluster',
'--name spark-pi',
' --class org.apache.spark.examples.SparkPi',
'--conf spark.executor.instances=5',
'--conf spark.kubernetes.container.image=<prebuilt-spark-image-name>',
'local:///path/to/examples.jar'
],
#...
)

Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

Im using kubernetes cluster on AWS to run spark jobs ,im using spark 2.3 ,now i want to run spark-submit from AWS lambda function to k8s master,would like to know if there is any REST interface to run Spark submit on k8s Master?
Unfortunately, it is not possible for Spark 2.3, in case you are using native Kubernetes support.
Based on description from deployment instruction, submission process contains several steps:
Spark creates a Spark driver running within a Kubernetes pod.
The driver creates executors which are also running within Kubernetes pods
The driver connects to them, and executes application code
When the application completes, executor pods terminate and are cleaned up, but the driver pod persists its logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
So, in fact, you have no place to submit a job until you start a submission process, which will launch the first Spark's pod (driver) for you. Only once application completes, everything is terminated.
Please also see similar answer for this question under the link

How to start, stop and re-start cluster manually in Bluemix Spark?

What is the simplest way to start and stop Spark clusters manually in Bluemix? I would basically want to run the same things that
sbin/start-all.sh
and
sbin/stop-all.sh
do on a standalone Spark installment.
You can't. The Apache Spark service in Bluemix runs clusters that are shared by many users. No user is allowed to shut down or start these clusters.

Resources