Why spark executor are not dying - apache-spark

Here is my setup:
Kubernetes cluster running airflow, which submits the spark job to Kubernetes cluster, job runs fine but the container are suppose to die once the job is done but they are still hanging there.
Airflow Setup comes up on K8S cluster.
Dag is baked in the airflow docker image because somehow I am not able to sync the dags from s3. For some reason the cron wont run.
Submits the spark job to K8S Cluster and job runs fine.
But now instead of dying post execution and completion of job it still hangs around.
Here is my SparkSubmitOperator function
spark_submit_task = SparkSubmitOperator(
task_id='spark_submit_job_from_airflow',
conn_id='k8s_spark',
java_class='com.dom.rom.mainclass',
application='s3a://some-bucket/jars/demo-jar-with-dependencies.jar',
application_args=['300000'],
total_executor_cores='8',
executor_memory='20g',
num_executors='9',
name='mainclass',
verbose=True,
driver_memory='10g',
conf={
'spark.hadoop.fs.s3a.aws.credentials.provider': 'com.amazonaws.auth.InstanceProfileCredentialsProvider',
'spark.rpc.message.maxSize': '1024',
'spark.hadoop.fs.s3a.impl': 'org.apache.hadoop.fs.s3a.S3AFileSystem',
'spark.kubernetes.container.image': 'dockerhub/spark-image:v0.1',
'spark.kubernetes.namespace' : 'random',
'spark.kubernetes.container.image.pullPolicy': 'IfNotPresent',
'spark.kubernetes.authenticate.driver.serviceAccountName': 'airflow-spark'
},
dag=dag,
)

Figured the problem it was my mistake I wasn't closing the spark session, added the following
session.stop();

Related

Pyspark job queue config precedence - spark-submit vs SparkSession.builder

I have a shell script which runs a spark-submit command. I want to specify the resource queue name onto which the job runs.
When I use:
spark-submit --queue myQueue job.py (here the job is properly submitted on 'myQueue')
But when I use: spark-submit job.py and inside job.py I create a spark session like:
spark=SparkSession.builder.appName(appName).config("spark.yarn.queue", "myQueue") - In this case the job runs on default queue. Also on checking the configs of this running job on the spark UI, it shows me that queue name is "myQueue" but still the job runs on default queue only.
Can someone explain how can I pass the queue name in sparkSession.builder configs so that it takes into effect.
Using pyspark version 2.3

Scheduling Spark Jobs Running on Kubernetes via Airflow

I have a spark job that runs via a Kubernetes pod . Till now I was using an Yaml file to run my jobs manually.
Now , I want to schedule my spark jobs via airflow.
This is the first time I am using airflow and I am unable to figure out how I can add my Yaml file in the airflow.
From what I have read is that I can schedule my jobs via a DAG in Airflow.
A dag example is this :
from airflow.operators import PythonOperator
from airflow.models import DAG
from datetime import datetime, timedelta
args = {'owner':'test', 'start_date' : datetime(2019, 4, 3), 'retries': 2, 'retry_delay': timedelta(minutes=1) }
dag = DAG('test_dag', default_args = args, catchup=False)
def print_text1():
print("hell-world1")
def print_text():
print('Hello-World2')
t1 = PythonOperator(task_id='multitask1', python_callable=print_text1, dag=dag)
t2 = PythonOperator(task_id='multitask2', python_callable=print_text, dag=dag)
t1 >> t2
In this case the above methods will get executed on after the other once I play the DAG.
Now , in case I want to run a spark submit job , what should I do?
I am using Spark 2.4.4
Airflow has a concept of operators, which represent Airflow tasks. In your example PythonOperator is used, which simply executes Python code and most probably not the one you are interested in, unless you submit Spark job within Python code. There are several operators that you can take use of:
BashOperator, which executes the given bash script for you. You may run kubectl or spark-submit using it directly
SparkSubmitOperator, the specific operator to call spark-submit
KubernetesPodOperator, creates Kubernetes pod for you, you can launch your Driver pod directly using it
Hybrid solutions, eg. HttpOperator + Livy on Kubernetes, you spin up Livy server on Kubernetes, which serves as a Spark Job Server and provides REST API to be called by Airflow HttpOperator
Note: for each of the operators you need to ensure that your Airflow environment contains all the required dependencies for execution as well as the credentials configured to access the required services.
Also you can refer the existing thread:
Airflow SparkSubmitOperator - How to spark-submit in another server

How to Launch a Spark Job in EMR creation with terraform

My use case is the following. Via Terraform I want to create an EMR cluster, Start a Spark Job and terminate the cluster when the job is finished.
I found this step mechanism in Terraform documentation (https://www.terraform.io/docs/providers/aws/r/emr_cluster.html#step-1) but I didn't find any example for a Spark Job on Google (an
Maybe i'm doing wrong because my use case seems pretty simple but i can't find an other way to do it.
Thanks for your help
I found it finally
With step instruction it's possible to launch a Spark Job form a Jar stored in s3
step {
action_on_failure = "TERMINATE_CLUSTER"
name = "Launch Spark Job"
hadoop_jar_step {
jar = "command-runner.jar"
args = ["spark-submit","--class","com.mycompany.App","--master","yarn","s3://my_bucket/my_jar_with_dependencies.jar"]
}
}

Unable to gracefully finish an Airflow DAG

I have a spark-streaming job that runs on EMR, scheduled by Airflow. We want to gracefully terminate this EMR cluster every week.
But when I issue the kill or SIGTERM signal to the running spark-streaming application it is reporting as "failed" task in the Airflow DAG. This is preventing the DAG to move further, preventing the next run from triggering.
Is there any way either to kill the running spark-streaming app to mark success or to let the DAG complete even though it sees the task as failed?
Is there any way either to kill the running spark-streaming app to mark success or to let the DAG complete even though it sees the task as failed?
For the first part, can you share your code that kills the Spark app? I think you should be able to have this task return success and have everything downstream "just work".
I'm not too familiar with EMR, but looking at the docs it looks like "job flow" is their name for the Spark cluster. In that case, are you using the built-in EmrTerminateJobFlowOperator?
I wonder if the failed task is the cluster terminating propagating back an error code or something? Also, is it possible that the cluster is failing to terminate and your code is raising an exception leading to a failed task?
To answer the second part, if you have multiple upstream tasks, you can use an alternate trigger rule on the operator to determine which downstream tasks run.
class TriggerRule(object):
ALL_SUCCESS = 'all_success'
ALL_FAILED = 'all_failed'
ALL_DONE = 'all_done'
ONE_SUCCESS = 'one_success'
ONE_FAILED = 'one_failed'
DUMMY = 'dummy'
https://github.com/apache/incubator-airflow/blob/master/airflow/utils/trigger_rule.py
https://github.com/apache/incubator-airflow/blob/master/docs/concepts.rst#trigger-rules

Spark 2.0 Standalone mode Dynamic Resource Allocation Worker Launch Error

I'm running Spark 2.0 on Standalone mode, successfully configured it to launch on a server and also was able to configure Ipython Kernel PySpark as option into Jupyter Notebook. Everything works fine but I'm facing the problem that for each Notebook that I launch, all of my 4 workers are assigned to that application. So if another person from my team try to launch another Notebook with PySpark kernel, it simply does not work until I stop the first notebook and release all the workers.
To solve this problem I'm trying to follow the instructions from Spark 2.0 Documentation.
So, on my $SPARK_HOME/conf/spark-defaults.conf I have the following lines:
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.executorIdleTimeout 10
Also, on $SPARK_HOME/conf/spark-env.sh I have:
export SPARK_WORKER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=1
But when I try to launch the workers, using $SPARK_HOME/sbin/start-slaves.sh, only the first worker is successfully launched. The log from the first worker end up like this:
16/11/24 13:32:06 INFO Worker: Successfully registered with master
spark://cerberus:7077
But the log from workers 2-4 show me this error:
INFO ExternalShuffleService: Starting shuffle service on port 7337
with useSasl = false 16/11/24 13:32:08 ERROR Inbox: Ignoring error
java.net.BindException: Address already in use
It seems (to me) that the first worker successfully starts the shuffle-service at port 7337, but the workers 2-4 "does not know" about this and try to launch another shuffle-service on the same port.
The problem occurs also for all workers (1-4) if I first launch a shuffle-service (using $SPARK_HOME/sbin/start-shuffle-service.sh) and then try to launch all the workers ($SPARK_HOME/sbin/start-slaves.sh).
Is any option to get around this? To be able to all workers verfy if there is a shuffle service running and connect to it instead of try to create a new service?
I had the same issue and seemed to get it working by removing the spark.shuffle.service.enabled item from the config file (in fact I don't have any dynamicAllocation-related items in there) and instead put this in the SparkConf when I request a SparkContext:
sconf = pyspark.SparkConf() \
.setAppName("sc1") \
.set("spark.dynamicAllocation.enabled", "true") \
.set("spark.shuffle.service.enabled", "true")
sc1 = pyspark.SparkContext(conf=sconf)
I start the master & slaves as normal:
$SPARK_HOME/sbin/start-all.sh
And I have to start one instance of the shuffler-service:
$SPARK_HOME/sbin/start-shuffle-service.sh
Then I started two notebooks with this context and got them both to do a small job. The first notebook's application does the job and is in the RUNNING state, the second notebook's application is in the WAITING state. After a minute (default idle timeout), the resources get reallocated and the second context gets to do its job (and both are in RUNNING state).
Hope this helps,
John

Resources