I use airflow to submit multiple hourly spark jobs to an EMR. In one hour I can have upwards to 30 spark submits.
The EMR is 1 master node and 4 core nodes all c4.4xlarge.
My spark submits use master yarn and deploy-mode client.
Every hour multiple airflow dags will ssh into the EMR and spark-submit their jobs. Most of the jobs are small and finishes within a few minutes, except for a few that take 10-15 mins.
I have been hitting a reoccurring error logged by airflow and once one task receives it, it waterfalls down to the rest of them:
airflow.exceptions.AirflowException: SSH operator error: No existing session
This means airflow was unable to ssh into the cluster. I even tried to ssh through my computer and it just hangs. Is it possible there are too many spark tasks running? I wouldn't think so because my cluster is pretty big for the jobs I have to run.
Related
My main Intent is to get Spark (3.3) running on k8s with HDFS.
I went though the Spark website and got the spark pi program running on a k8s cluster in the form of spark-submit command. I read that if we submit multiple jobs to k8s cluster the k8s may end up starving all pods -- meaning there is no queueing in place, no scheduler (like yarn) which keeps a check on resources and arranges the tasks across the nodes.
So, my question is: what is the simplest way to write a scheduler in k8s? I read about volcano -- but it's not yet in GA. I read about Gang, Younikorn -- but I don't see much community support.
I'm using an Azure HDInsight cluster and through the Jupyter interface included with HDI I can run interactive spark queries, but I was wondering how to run long running jobs. E.g. if during my interactive querying I realize I want to do some long running job that will take a few hours, is there way to run a command from PySpark itself, e.g. read data from path x, do some transformation, and save in path y?
Currently if I just try to do that job inside the PySpark session itself and leave it running, Livy will eventually timeout and kill the job. Is there some command to submit the batch job and get an ID I can query later to get job status?
I'm trying to set up a DAG that will create a Spark Cluster in the first task, submit spark applications to the cluster in interim tasks, and have finally teardown the Spark Cluster in the last task.
The approach I'm attempting right now is to use KubernetesPodOperators to create Spark Master and Worker pods. The issue is that they run a spark daemon which never exits. The fact that the command called on the pod never exits means that those tasks gets stuck in airflow in a running phase. So, I'm wondering if there's a way to run the spark daemon and then continue on to the next tasks in the DAG?
The approach I'm attempting right now is to use KubernetesPodOperators to create Spark Master and Worker pods.
Apache Spark provides working support for executing jobs in a Kubernetes cluster. It delivers a driver that is capable of starting executors in pods to run jobs.
You don't need to create Master and Worker pods directly in Airflow.
Rather build a Docker image containing Apache Spark with Kubernetes backend.
An example Dockerfile is provided in the project.
Then submit the given jobs to the cluster in a container based off this image by using KubernetesPodOperator. The following sample job is adapted from documentation provided in Apache Spark to submit spark jobs directly to a Kubernetes cluster.
from airflow.operators.kubernetes_pod_operator import KubernetesPodOperator
kubernetes_full_pod = KubernetesPodOperator(
task_id='spark-job-task-ex',
name='spark-job-task',
namespace='default',
image='<prebuilt-spark-image-name>',
cmds=['bin/spark-submit'],
arguments=[
'--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>',
'--deploy-mode cluster',
'--name spark-pi',
' --class org.apache.spark.examples.SparkPi',
'--conf spark.executor.instances=5',
'--conf spark.kubernetes.container.image=<prebuilt-spark-image-name>',
'local:///path/to/examples.jar'
],
#...
)
I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running after that, it's a sign that the job is stuck. Yarn and the Spark UI is still responsive, the it's just that an executor gets stuck on a task.
Background: I'm using an ephemeral EMR cluster that performs only one step before terminating, so it's not a problem to kill it off if I notice this job is hanging.
What's the easiest way to kill the task, job, or cluster in this case? Ideally this would not involve setting up some extra service to monitor the job -- ideally there would be some kind of spark / yarn / emr setting I could use.
Note: I've tried using spark speculation to unblock the stuck spark job, but that doesn't help.
EMR has a Bootstrap Actions feature where you can run scripts that start up when initializing the cluster. I've used this feature along with a startup script that monitors how long the cluster has been online and terminates itself after a certain time.
I use a script based off this one for the bootstrap action. https://github.com/thomhopmans/themarketingtechnologist/blob/master/6_deploy_spark_cluster_on_aws/files/terminate_idle_cluster.sh
Basically make a script that checks /proc/uptime to see how long the EC2 machine has been online and after uptime surpasses your time limit you can send a shutdown command to the cluster.
Is there a spark config parameter we can pass while submitting jobs through spark-submit, that will kill/fail job if it does not gets containers in given time?
For example if job requested for 8 yarn containers, which could not be allocated for 2 hours, then the job kill itself.
EDIT: We have scripts launching spark or MR jobs on a cluster. This issue is not a major one for MR jobs as they can start even if 1 container is available. Also MR jobs need much less memory so their containers can be smaller, hence more containers are available in cluster.