Using airflow to run spark streaming jobs? - apache-spark

We have in our hadoop cluster Spark Batch jobs and and Spark streaming jobs.
We would like to schedule and manage them both on the same platform.
We came across airflow, Which fits our need for a
"platform to author, schedule, and monitor workflows".
I just want to be able to stop and start spark streaming job. Using airflow graphs and profiling is less of an issue.
My question is,
Beside losing some functionality(graphs, profiling) , Why shouldn't I use Airflow to run spark streaming jobs?
I came across this question :
Can airflow be used to run a never ending task?
which says it's possible and not why you shouldn't.

#mMorozonv's Looks good. You could have one DAG start the stream if it does not exist. Then a second DAG as a health checker to track it's progress. If the health check fails you could trigger the first DAG again.
Alternatively you can run the stream with a trigger interval of once[1].
# Load your Streaming DataFrame
sdf = spark.readStream.load(path="data/", format="json", schema=my_schema)
# Perform transformations and then write…
sdf.writeStream.trigger(once=True).start(path="/out/path", format="parquet")
This gives you all the same benefits of spark streaming, with the flexibility of batch processing.
You can simply point the stream at your data and this job will detect all the new files since the last iteration (using checkpointing), run a streaming batch, then terminate. You could trigger your airflow DAG's schedule to suit whatever lag you'd like to process data at (every minute, hour, etc.).
I wouldn't recommend this for low latency requirements, but its very suitable to be run every minute.
[1] https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html

Using Airflow branching functionality we can have one dag which will do both scheduling and monitoring of our streaming job. Dag will do a status check of the application and in case application is not running dag will submit a streaming job. In another case dag execution can be finished or you can add a sensor which will check streaming job status after some time with alerts and other stuff you need.
There are two main problems:
Submit streaming application without waiting until it will be
finished. Otherwise our operator will run until it will reach execution_timeout;
That problem can be solved by scheduling out streaming job under cluster mode with spark.yarn.submit.waitAppCompletion configuration parameter set tofalse
Check the status of our streaming operator;
We can check streaming application status using Yarn. For example we can use command yarn application -list -appStates RUNNING . In case our application will be among the list of running applications we should no trigger our streaming job. The only thing is to make streaming job name unique.

There are no strict reasons why you shouldn't use Airflow to run Spark Streaming job. In fact you can monitor your process by periodically logging some metrics with
LOG.info(query.lastProgress)
LOG.info(query.status)
and see them in task log

Related

How does DAG make Apache Spark fault-tolerant?

I am a beginner to Apache Spark. I was trying to understand the concept of DAG which Apache Spark creates and when we apply transformations one after another and which gets executed once an action is performed.
What I could make out is that in the event of a job failure, DAG comes to the rescue. Since all the intermediate RDDs are stored in the memory, Spark knows till which step the job ran successfully and restart the job from that point only, instead of starting the job from the beginning.
Now I have several questions here:
Can DAG make Spark resilient to node failures ?
Is it the driver node which maintains the DAG ?
Can there be multiple DAGs for a single execution ?
What I could make out is that in the event of a job failure, DAG comes to the rescue. Since all the intermediate RDDs are stored in the memory, Spark knows till which step the job ran successfully and restart the job from that point only, instead of starting the job from the beginning.
I think what you have said above based on your understanding is not fully correct.
DAG comes to rescue in the event of Node failures and not the job failures.
Spark driver knows which worker node is working on which partition of data through cluster manager. So, when the cluster manager comes to know that specific node is dead then it assigns another node to start processing. Because of the DAG , new worker node know the tasks that it has to work on but it has to perform all the transformation from the start. If the node fails all the stuffs that you had in memory also goes away. DAG helps spark to be fault-tolerant because it can recover from node failures.
Your Question 1:
Can DAG make Spark resilient to node failures ?
Yes DAG makes it fault tolerance to node failures.
Question 2:
Is it the driver node which maintains the DAG ?
Yes. When an action is called, the created DAG is submitted to DAG Scheduler, where it gets converted into stages of jobs.
Question 3:
Can there be multiple DAGs for a single execution ?
No. you cannot have multiple DAGs because DAG is kind of a graph that represents the operation that you perform.

How to use Airflow to restart a failed structured streaming spark job?

I need to run a structured streaming spark job in AWS EMR. As the resilience requirement, if the spark job failed due to some reasons, we hope the spark job can be recreated in EMR. It is similar as the task orchestration in ECS, which can restart the task if health check is failed. However, EMR is more a compute engine instead of orchestration system.
I am looking for some big data workflow orchestration tool, such as Airflow. However, it can not support the cycle in DAG. How can I implement some functions as below?
step_adder (EmrAddStepsOperator) >> step_checker (EmrStepSensor) >> step_adder (EmrAddStepsOperator).
What is the suggested way to improve such job level resilience? Any comments are welcome!
Some of the resilience are already cover by Apache Spark (jobs submitted with spark-submit), however when then you want to interact with different processes, that are not withing Spark, then Airflow might be a solution. In your case, a Sensor can help detect if a certain condition happened or not. Based on that you can decide in the DAG. Here is a simple HttpSensor that waits for a batch job to see if it's successfully finished
wait_batch_to_finish = HttpSensor(
http_conn_id='spark_web',
task_id="wait_batch_to_finish",
method="GET",
headers={"Content-Type": "application/json"},
endpoint="/json",
response_check=lambda response: check_spark_status(response, "{{ ti.xcom_pull('batch_intel_task')}}"),
poke_interval=60,
dag=dag
)

Kill Spark Job or terminate EMR Cluster if job takes longer than expected

I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running after that, it's a sign that the job is stuck. Yarn and the Spark UI is still responsive, the it's just that an executor gets stuck on a task.
Background: I'm using an ephemeral EMR cluster that performs only one step before terminating, so it's not a problem to kill it off if I notice this job is hanging.
What's the easiest way to kill the task, job, or cluster in this case? Ideally this would not involve setting up some extra service to monitor the job -- ideally there would be some kind of spark / yarn / emr setting I could use.
Note: I've tried using spark speculation to unblock the stuck spark job, but that doesn't help.
EMR has a Bootstrap Actions feature where you can run scripts that start up when initializing the cluster. I've used this feature along with a startup script that monitors how long the cluster has been online and terminates itself after a certain time.
I use a script based off this one for the bootstrap action. https://github.com/thomhopmans/themarketingtechnologist/blob/master/6_deploy_spark_cluster_on_aws/files/terminate_idle_cluster.sh
Basically make a script that checks /proc/uptime to see how long the EC2 machine has been online and after uptime surpasses your time limit you can send a shutdown command to the cluster.

Apache Nifi - Submitting Spark batch jobs through Apache Livy

I want to schedule my spark batch jobs from Nifi. I can see there is ExecuteSparkInteractive processor which submit spark jobs to Livy, but it executes the code provided in the property or from the content of the incoming flow file. How should I schedule my spark batch jobs from Nifi and also take different actions if the batch job fails or succeeds?
You could use ExecuteProcess to run a spark-submit command.
But what you seem to be looking for, is not a DataFlow management tool, but a workflow manager. Two great examples for workflow managers are: Apache Oozie & Apache Airflow.
If you still want to use it to schedule spark jobs, you can use the GenerateFlowFile processor to be scheduled(on primary node so it won't be scheduled twice - unless you want to), and then connect it to the ExecuteProcess processor, and make it run the spark-submit command.
For a little more complex workflow, I've written an article about :)
Hope it will help.

How to tell if your spark job is waiting for resources

I have a pyspark job that I submit to a standalone spark cluster - this is an auto scaling cluster on ec2 boxes so when jobs are submitted and not enough nodes are available, after a few minutes a few more boxes spin up and become available.
We have a #timeout decorator on the main part of the spark job to timeout and error when it's exceeded a certain time threshold (put in place because of some jobs hanging). The issue is that sometimes a job may not have gotten to actually starting because its waiting on resources yet #timeout function is evaluated and jobs error out as a result.
So I'm wondering if there's anyway to tell from within the application itself, with code, if the job is waiting for resources?
To know the status of the application then you need to access the Spark Job History server from where you can get the current status of the job.
You can solve your problem as follows:
Get Application ID of your job through sc.applicationId.
Then use this application Id with Spark History Server REST APIs to get the status of the submitted job.
You can find the Spark History Server Rest APIs at link.

Resources