I have a large Spark SQL (v 2.4) that joins two hive tables and then aggregates them.
It is reading a table more than 1 TB and another is 500GB+.
On Spark UI, I am seeing Stage ID = 2 in completed Stages.
But it keeps on adding new Retry stages and each time with different number of Tasks for each Retry stage.
What exactly is happening here? Can you all please point me to a any documentation on how can a Completed Stage be Retried by Spark?
Related
I am new to Spark. I have couple of questions regarding the Spark Web UI:-
I have seen that Spark can create multiple Jobs for the same
application. On what basis does it creates the Jobs ?
I understand Spark creates multiple Stages for a single Job around
Shuffle boundaries. Also I understand that there is 1 task per
partition. However, I have seen that a particular Stage (E.g. Stage1)
of a particular Job creating lesser number of tasks than the default
shuffle partitions value (for e.g. only 2/2 completed). And I have
also seen, the next Stage (Stage 2) of the same Job creating
1500 tasks (for E.g. 1500/1500 completed) which is more than
the default shuffle partitions value.
So, how does Spark determine how many tasks should it
create for any particular Stage to execute ?
Can anyone please help me understand the above.
the max number of task in one moment dependent on you cores and exec numbers,
different stage have different task number
do tasks in the same stage work simultaneously? if so, the line between partitions in a stage refers to what? example of a DAG
here is a good link for your reading. that explains DAG in detail and few other things that may be of interest. databricks blog on DAG
I can try to explain. as each stage is created it has a set of tasks that are divided. when an action is encountered. Driver sends the task to executors. based on how your data is partitioned N number tasks are invoked on the data in your distributed cluster. so the arrows that you are seeing is execution plan. as in it cannot do map function prior to reading the file. each node that has some data will execute those tasks in order that is provided by the DAG.
so I understand that a stage is a set of tasks that work in the same node.
so why do I get two stages when I work in local?
A stage is a set of parallel tasks - one task per partition.
Number of stages is defined by number of shuffle/wide transformations.
So coming back to your case, if you have shuffle operation then it will result in two stages.
I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.
I did some analysis using Spark UI and found below grey areas which can be improved:
Stage 8 in Job 5 has shuffle output of 1.5 TB.
Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.
We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:
-- num-executor 200
-- executor-core 1
-- executor-memory 6G
-- deployment mode client
Attaching Image of UI as well.
Now my questions are:
Where can I find driver log for this job?
In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. Any explation for this?
Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?
Below are the thought processes that may guide you to some extent:
Is it necessary to have one core per executor? The executor need not be fat always. You can have more cores in one executor. it is a trade-off between creating a slim vs fat executors.
Configure shuffle partition parameter spark.sql.shuffle.partitions
Ensure while reading data from Hive, you are using Sparksession (basically HiveContext). This will pull the data into Spark memory from HDFS and schema information from Metastore of Hive.
Yes, Dynamic allocation of resources is a feature that helps in allocating the right set of resources. It is better than having fixed allocation.
I've a job which takes a HiveQL joining 2 tables (2.5 TB, 45GB), repartitions to 100 and then does some other transformations. This executed fine earlier.
Job stages:
Stage 0: hive table 1 scan
Stage 1: Hive table 2 scan
Stage 2: Tungsten exchange for the join
Stage 3: Tungsten exchange for the reparation
Today the job is stuck in Stage 2. Out of 200 tasks which are supposed to be executed none of them have started but 290 have failed due to preempted executors.
On drilling down the stage it says "no metrics reported by executors". Under executors tab I could see 40 executors with active tasks though. Also when the stage 2 starts the shuffle read increases gradually and stops at 45GB and after this I don't see any progress.
Any inputs on how to resolve this issue? I'll try reducing the executor memory to see if resource allocation is the issue.
Thanks.
Turns out it was a huge dataset and the joins were being re-evaluated during this stage. The tasks were running for long when it was reading the datasets. I persisted the joined dataset to make it progress faster.