I need to run a structured streaming spark job in AWS EMR. As the resilience requirement, if the spark job failed due to some reasons, we hope the spark job can be recreated in EMR. It is similar as the task orchestration in ECS, which can restart the task if health check is failed. However, EMR is more a compute engine instead of orchestration system.
I am looking for some big data workflow orchestration tool, such as Airflow. However, it can not support the cycle in DAG. How can I implement some functions as below?
step_adder (EmrAddStepsOperator) >> step_checker (EmrStepSensor) >> step_adder (EmrAddStepsOperator).
What is the suggested way to improve such job level resilience? Any comments are welcome!
Some of the resilience are already cover by Apache Spark (jobs submitted with spark-submit), however when then you want to interact with different processes, that are not withing Spark, then Airflow might be a solution. In your case, a Sensor can help detect if a certain condition happened or not. Based on that you can decide in the DAG. Here is a simple HttpSensor that waits for a batch job to see if it's successfully finished
wait_batch_to_finish = HttpSensor(
http_conn_id='spark_web',
task_id="wait_batch_to_finish",
method="GET",
headers={"Content-Type": "application/json"},
endpoint="/json",
response_check=lambda response: check_spark_status(response, "{{ ti.xcom_pull('batch_intel_task')}}"),
poke_interval=60,
dag=dag
)
Related
I am a beginner to Apache Spark. I was trying to understand the concept of DAG which Apache Spark creates and when we apply transformations one after another and which gets executed once an action is performed.
What I could make out is that in the event of a job failure, DAG comes to the rescue. Since all the intermediate RDDs are stored in the memory, Spark knows till which step the job ran successfully and restart the job from that point only, instead of starting the job from the beginning.
Now I have several questions here:
Can DAG make Spark resilient to node failures ?
Is it the driver node which maintains the DAG ?
Can there be multiple DAGs for a single execution ?
What I could make out is that in the event of a job failure, DAG comes to the rescue. Since all the intermediate RDDs are stored in the memory, Spark knows till which step the job ran successfully and restart the job from that point only, instead of starting the job from the beginning.
I think what you have said above based on your understanding is not fully correct.
DAG comes to rescue in the event of Node failures and not the job failures.
Spark driver knows which worker node is working on which partition of data through cluster manager. So, when the cluster manager comes to know that specific node is dead then it assigns another node to start processing. Because of the DAG , new worker node know the tasks that it has to work on but it has to perform all the transformation from the start. If the node fails all the stuffs that you had in memory also goes away. DAG helps spark to be fault-tolerant because it can recover from node failures.
Your Question 1:
Can DAG make Spark resilient to node failures ?
Yes DAG makes it fault tolerance to node failures.
Question 2:
Is it the driver node which maintains the DAG ?
Yes. When an action is called, the created DAG is submitted to DAG Scheduler, where it gets converted into stages of jobs.
Question 3:
Can there be multiple DAGs for a single execution ?
No. you cannot have multiple DAGs because DAG is kind of a graph that represents the operation that you perform.
We have in our hadoop cluster Spark Batch jobs and and Spark streaming jobs.
We would like to schedule and manage them both on the same platform.
We came across airflow, Which fits our need for a
"platform to author, schedule, and monitor workflows".
I just want to be able to stop and start spark streaming job. Using airflow graphs and profiling is less of an issue.
My question is,
Beside losing some functionality(graphs, profiling) , Why shouldn't I use Airflow to run spark streaming jobs?
I came across this question :
Can airflow be used to run a never ending task?
which says it's possible and not why you shouldn't.
#mMorozonv's Looks good. You could have one DAG start the stream if it does not exist. Then a second DAG as a health checker to track it's progress. If the health check fails you could trigger the first DAG again.
Alternatively you can run the stream with a trigger interval of once[1].
# Load your Streaming DataFrame
sdf = spark.readStream.load(path="data/", format="json", schema=my_schema)
# Perform transformations and then write…
sdf.writeStream.trigger(once=True).start(path="/out/path", format="parquet")
This gives you all the same benefits of spark streaming, with the flexibility of batch processing.
You can simply point the stream at your data and this job will detect all the new files since the last iteration (using checkpointing), run a streaming batch, then terminate. You could trigger your airflow DAG's schedule to suit whatever lag you'd like to process data at (every minute, hour, etc.).
I wouldn't recommend this for low latency requirements, but its very suitable to be run every minute.
[1] https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
Using Airflow branching functionality we can have one dag which will do both scheduling and monitoring of our streaming job. Dag will do a status check of the application and in case application is not running dag will submit a streaming job. In another case dag execution can be finished or you can add a sensor which will check streaming job status after some time with alerts and other stuff you need.
There are two main problems:
Submit streaming application without waiting until it will be
finished. Otherwise our operator will run until it will reach execution_timeout;
That problem can be solved by scheduling out streaming job under cluster mode with spark.yarn.submit.waitAppCompletion configuration parameter set tofalse
Check the status of our streaming operator;
We can check streaming application status using Yarn. For example we can use command yarn application -list -appStates RUNNING . In case our application will be among the list of running applications we should no trigger our streaming job. The only thing is to make streaming job name unique.
There are no strict reasons why you shouldn't use Airflow to run Spark Streaming job. In fact you can monitor your process by periodically logging some metrics with
LOG.info(query.lastProgress)
LOG.info(query.status)
and see them in task log
I want to schedule my spark batch jobs from Nifi. I can see there is ExecuteSparkInteractive processor which submit spark jobs to Livy, but it executes the code provided in the property or from the content of the incoming flow file. How should I schedule my spark batch jobs from Nifi and also take different actions if the batch job fails or succeeds?
You could use ExecuteProcess to run a spark-submit command.
But what you seem to be looking for, is not a DataFlow management tool, but a workflow manager. Two great examples for workflow managers are: Apache Oozie & Apache Airflow.
If you still want to use it to schedule spark jobs, you can use the GenerateFlowFile processor to be scheduled(on primary node so it won't be scheduled twice - unless you want to), and then connect it to the ExecuteProcess processor, and make it run the spark-submit command.
For a little more complex workflow, I've written an article about :)
Hope it will help.
I am running spark job on ec2 cluster, I have a trigger that submits job periodically. I do not want to submit job if one job is already running on cluster. Is there any api that can give me this information?
Spark, and by extension, Spark Streaming offer an operational REST API at http://<host>:4040/api/v1
Consulting the status of the current application will give you the information sought.
Check the documentation: https://spark.apache.org/docs/2.1.0/monitoring.html#rest-api
you can consult the UI to see the status
eg.
If you run locally, take a look at the localhost:4040
I had pretty big expectations from Spark Job Server, but found out it critically lack of documentation.
Could you please answer one/all of next questions:
Does Spark Job Server submit jobs through Spark session?
Is it possible to run few jobs in parallel with Spark Job Server? I saw people faced some troubles, I haven't seen solution yet.
Is it possible to run few jobs in parallel with different CPU, cores, executors configs?
Spark jobserver do not support SparkSession yet. We will be working on it.
Either you can create multiple contexts or you could run a context to use FAIR scheduler.
Use different contexts with different resource config.
Basically job server is just a rest API for creating spark contexts. So you should be able to do what you could do with spark context.