Scheduling Spark Jobs Running on Kubernetes via Airflow - apache-spark

I have a spark job that runs via a Kubernetes pod . Till now I was using an Yaml file to run my jobs manually.
Now , I want to schedule my spark jobs via airflow.
This is the first time I am using airflow and I am unable to figure out how I can add my Yaml file in the airflow.
From what I have read is that I can schedule my jobs via a DAG in Airflow.
A dag example is this :
from airflow.operators import PythonOperator
from airflow.models import DAG
from datetime import datetime, timedelta
args = {'owner':'test', 'start_date' : datetime(2019, 4, 3), 'retries': 2, 'retry_delay': timedelta(minutes=1) }
dag = DAG('test_dag', default_args = args, catchup=False)
def print_text1():
print("hell-world1")
def print_text():
print('Hello-World2')
t1 = PythonOperator(task_id='multitask1', python_callable=print_text1, dag=dag)
t2 = PythonOperator(task_id='multitask2', python_callable=print_text, dag=dag)
t1 >> t2
In this case the above methods will get executed on after the other once I play the DAG.
Now , in case I want to run a spark submit job , what should I do?
I am using Spark 2.4.4

Airflow has a concept of operators, which represent Airflow tasks. In your example PythonOperator is used, which simply executes Python code and most probably not the one you are interested in, unless you submit Spark job within Python code. There are several operators that you can take use of:
BashOperator, which executes the given bash script for you. You may run kubectl or spark-submit using it directly
SparkSubmitOperator, the specific operator to call spark-submit
KubernetesPodOperator, creates Kubernetes pod for you, you can launch your Driver pod directly using it
Hybrid solutions, eg. HttpOperator + Livy on Kubernetes, you spin up Livy server on Kubernetes, which serves as a Spark Job Server and provides REST API to be called by Airflow HttpOperator
Note: for each of the operators you need to ensure that your Airflow environment contains all the required dependencies for execution as well as the credentials configured to access the required services.
Also you can refer the existing thread:
Airflow SparkSubmitOperator - How to spark-submit in another server

Related

PySpark batch job's configuration submitted through Apache Livy have no effect

I submitted spark batch job through Livy to the remote cluster with the following request body.
REQUEST_BODY = {
'file': '/spark/batch/job.py',
'conf': {
'spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation': 'true',
'spark.driver.cores': 1,
'spark.driver.memory': f'12g',
'spark.executor.cores': 1,
'spark.executor.memory': f'8g',
'spark.dynamicAllocation.maxExecutors': 4,
},
}
And in the python file containing application to be run, the SparkSession is created using the following command:
-- inside /spark/batch/job.py --
spark = SparkSession.builder.getOrCreate()
// spark application after this point using the SparkSession created above
The application work just fine, but the spark acquire all of the resources in the cluster neglecting the configuration set in the request body 😕.
I am suspicious that the /spark/batch/job.py create another SparkSession apart from the one specifying in the Livy request body. But I am not sure how to use the SparkSession provided by the Livy though. The document about this topic is so less.
Does anyone facing the same issue? How can I solved this problem?
Thanks in advance everyone!

Pyspark job queue config precedence - spark-submit vs SparkSession.builder

I have a shell script which runs a spark-submit command. I want to specify the resource queue name onto which the job runs.
When I use:
spark-submit --queue myQueue job.py (here the job is properly submitted on 'myQueue')
But when I use: spark-submit job.py and inside job.py I create a spark session like:
spark=SparkSession.builder.appName(appName).config("spark.yarn.queue", "myQueue") - In this case the job runs on default queue. Also on checking the configs of this running job on the spark UI, it shows me that queue name is "myQueue" but still the job runs on default queue only.
Can someone explain how can I pass the queue name in sparkSession.builder configs so that it takes into effect.
Using pyspark version 2.3

Why spark executor are not dying

Here is my setup:
Kubernetes cluster running airflow, which submits the spark job to Kubernetes cluster, job runs fine but the container are suppose to die once the job is done but they are still hanging there.
Airflow Setup comes up on K8S cluster.
Dag is baked in the airflow docker image because somehow I am not able to sync the dags from s3. For some reason the cron wont run.
Submits the spark job to K8S Cluster and job runs fine.
But now instead of dying post execution and completion of job it still hangs around.
Here is my SparkSubmitOperator function
spark_submit_task = SparkSubmitOperator(
task_id='spark_submit_job_from_airflow',
conn_id='k8s_spark',
java_class='com.dom.rom.mainclass',
application='s3a://some-bucket/jars/demo-jar-with-dependencies.jar',
application_args=['300000'],
total_executor_cores='8',
executor_memory='20g',
num_executors='9',
name='mainclass',
verbose=True,
driver_memory='10g',
conf={
'spark.hadoop.fs.s3a.aws.credentials.provider': 'com.amazonaws.auth.InstanceProfileCredentialsProvider',
'spark.rpc.message.maxSize': '1024',
'spark.hadoop.fs.s3a.impl': 'org.apache.hadoop.fs.s3a.S3AFileSystem',
'spark.kubernetes.container.image': 'dockerhub/spark-image:v0.1',
'spark.kubernetes.namespace' : 'random',
'spark.kubernetes.container.image.pullPolicy': 'IfNotPresent',
'spark.kubernetes.authenticate.driver.serviceAccountName': 'airflow-spark'
},
dag=dag,
)
Figured the problem it was my mistake I wasn't closing the spark session, added the following
session.stop();

How to Launch a Spark Job in EMR creation with terraform

My use case is the following. Via Terraform I want to create an EMR cluster, Start a Spark Job and terminate the cluster when the job is finished.
I found this step mechanism in Terraform documentation (https://www.terraform.io/docs/providers/aws/r/emr_cluster.html#step-1) but I didn't find any example for a Spark Job on Google (an
Maybe i'm doing wrong because my use case seems pretty simple but i can't find an other way to do it.
Thanks for your help
I found it finally
With step instruction it's possible to launch a Spark Job form a Jar stored in s3
step {
action_on_failure = "TERMINATE_CLUSTER"
name = "Launch Spark Job"
hadoop_jar_step {
jar = "command-runner.jar"
args = ["spark-submit","--class","com.mycompany.App","--master","yarn","s3://my_bucket/my_jar_with_dependencies.jar"]
}
}

How to pull Spark jobs client logs submitted using Apache Livy batches POST method using AirFlow

I am working on submitting Spark job using Apache Livy batches POST method.
This HTTP request is send using AirFlow. After submitting job, I am tracking status using batch Id.
I want to show driver ( client logs) logs on Air Flow logs to avoid going to multiple places AirFLow and Apache Livy/Resource Manager.
Is this possible to do using Apache Livy REST API?
Livy has an endpoint to get logs /sessions/{sessionId}/log & /batches/{batchId}/log.
Documentation:
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-sessionssessionidlog
https://livy.incubator.apache.org/docs/latest/rest-api.html#get-batchesbatchidlog
You can create python functions like the one shown below to get logs:
http = HttpHook("GET", http_conn_id=http_conn_id)
def _http_rest_call(self, method, endpoint, data=None, headers=None, extra_options=None):
if not extra_options:
extra_options = {}
self.http.method = method
response = http.run(endpoint, json.dumps(data), headers, extra_options=extra_options)
return response
def _get_batch_session_logs(self, batch_id):
method = "GET"
endpoint = "batches/" + str(batch_id) + "/log"
response = self._http_rest_call(method=method, endpoint=endpoint)
# return response.json()
return response
Livy exposes REST API in 2 ways: session and batch. In your case, since we assume you are not using session, you are submitting using batches. You can post your batch using the curl command:
curl http://livy-server-IP:8998/batches
Once you have submitted the job, you would get the batch ID in return. Then you can curl using the command:
curl http://livy-server-IP:8998/batches/{batchId}/log
You can find the documentation at:
https://livy.incubator.apache.org/docs/latest/rest-api.html
If you want to avoid the above steps, you can use a ready-made AMI (namely, LightningFLow) from AWS Marketplace which provides Airflow with a custom Livy operator. Livy operator submits and tracks the status of the job every 30 sec (configurable), and it also provides spark logs at the end of the spark job in Airflow UI logs.
Note: LightningFlow comes pre-integrated with all required libraries, Livy, custom operators, and local Spark cluster.
Link for AWS Marketplace:
https://aws.amazon.com/marketplace/pp/Lightning-Analytics-Inc-LightningFlow-Integrated-o/B084BSD66V
This will enable you to view consolidated logs at one place, instead of shuffling between Airflow and EMR/Spark logs (Ambari/Resource Manager).

Resources