AWS EMR step in RUNNING state even when job has been completed - apache-spark

I am running a spark job which partitions data on two columns as an EMR step. The spark job has spark.sql.sources.partitionOverwriteMode set to dynamic and SaveMode as overwrite.
I can see the Spark job has finished execution by looking at the Spark UI but the EMR step continues to be in RUNNING state for more than an hour. I can also see _SUCCESS file in the root directory with timestamp in line with the spark job completion.
Any idea why the EMR step isn't completing or best practices to speed up the process?

Related

Clear /spark_metadata directory from previous job every time a new streaming job is submitted

Let's say you want to replace an old kafka spark streaming job running in aws ECS. When a new task definition is deploying there will be 2 jobs pointing to the same spark_metadata folder until the deploy process is finished.
Is it required to always clear up the spark metadata folder from previous task execution?

Why is there just 1 job id in dataproc when there are multiple actions in the pyspark script?

The definition of spark job is:
Job- A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.
So, why is it that each spark-submit creates just one job id in dataproc console that I can see?
Example: The following application should have 2 spark jobs
sc.parallelize(range(1000),10).collect()
sc.parallelize(range(1000),10).collect()
There is a difference between Dataproc job and Spark job. When you submit the script through Dataproc API/CLI, it creates a Dataproc job, which in turn calls spark-submit to submit the script to Spark. But inside Spark, the code above does create 2 Spark jobs. You can see it in the Spark UI:

How does DAG make Apache Spark fault-tolerant?

I am a beginner to Apache Spark. I was trying to understand the concept of DAG which Apache Spark creates and when we apply transformations one after another and which gets executed once an action is performed.
What I could make out is that in the event of a job failure, DAG comes to the rescue. Since all the intermediate RDDs are stored in the memory, Spark knows till which step the job ran successfully and restart the job from that point only, instead of starting the job from the beginning.
Now I have several questions here:
Can DAG make Spark resilient to node failures ?
Is it the driver node which maintains the DAG ?
Can there be multiple DAGs for a single execution ?
What I could make out is that in the event of a job failure, DAG comes to the rescue. Since all the intermediate RDDs are stored in the memory, Spark knows till which step the job ran successfully and restart the job from that point only, instead of starting the job from the beginning.
I think what you have said above based on your understanding is not fully correct.
DAG comes to rescue in the event of Node failures and not the job failures.
Spark driver knows which worker node is working on which partition of data through cluster manager. So, when the cluster manager comes to know that specific node is dead then it assigns another node to start processing. Because of the DAG , new worker node know the tasks that it has to work on but it has to perform all the transformation from the start. If the node fails all the stuffs that you had in memory also goes away. DAG helps spark to be fault-tolerant because it can recover from node failures.
Your Question 1:
Can DAG make Spark resilient to node failures ?
Yes DAG makes it fault tolerance to node failures.
Question 2:
Is it the driver node which maintains the DAG ?
Yes. When an action is called, the created DAG is submitted to DAG Scheduler, where it gets converted into stages of jobs.
Question 3:
Can there be multiple DAGs for a single execution ?
No. you cannot have multiple DAGs because DAG is kind of a graph that represents the operation that you perform.

spark structure streaming with efs is causing delay in job

Im using Spark Structure streaming 2.4.4. Im using Spark with kubernetes. But when i enable local checkpointing in some /tmp/ folder, jobs finishes in 7-8s. If EFS is mounted and checkpointing location is used on that then jobs are taking more than 5 mins and its quite unstable.
Please find the screenshot from spark sql tab.

How to debug why pending stage is in the unknown state?

I have a Spark batch job, which reads some json files writes them to Hive and then queries some other Hive tables, does computation and writes output in Orc format back to Hive.
What I experience is job gets stuck with one stage in pending state.
The DAG looks as follows:
I'm using Hadoop 2.7.3.2.6.5.0-292 and Spark is running on YARN.
I looked at the yarn logs, spark event logs, but do not see an issue.
Just rerunning the job results in same behavior.
The question is: what unknown state in stage means, how to debug why job is in it ?

Resources