Spark job getting stuck at 99% and doesn't continue [duplicate] - apache-spark

This question already has an answer here:
Spark final task takes 100x times longer than first 199, how to improve
(1 answer)
Closed 5 years ago.
I have a basic spark job that does a couple of joins. The 3 data frames that get joined are somewhat big, nearly 2 billion records in each of them. I have a spark infrastructure that automatically scales up nodes whenever necessary. It seems like a very simple spark SQL query whose results I write to disk. But the job always gets stuck at 99% when I look at from Spark UI.
Bunch of things I have tried are:
Increase the number of executors and executor memory.
Use repartition while writing the file.
Use the native spark join instead of spark SQL join etc
However, none of these things have worked. It would be great if somebody can share the experience of solving this problem. Thanks in advance.

Because of the join operations, all records with the same key are shuffled to the same executor. If you data is skewed, which means that there is one or a few keys which are very dominant in terms of the number of rows. Then this single executor which has to process all these rows. Essentially your Spark job becomes single threaded since this single key needs to be processed by a single thread.
Repartitioning will not help since your join operation will shuffle the data again by hashing the join key. You could try to increase the number of partitions in case of an unlucky hash.
This video explains the problems, and suggests a solution:
https://www.youtube.com/watch?v=6zg7NTw-kTQ
Cheers, Fokko

Related

Why can coalesce lead to too few nodes for processing?

I am trying to understand spark partitions and in a blog I come across this passage
However, you should understand that you can drastically reduce the parallelism of your data processing — coalesce is often pushed up further in the chain of transformation and can lead to fewer nodes for your processing than you would like. To avoid this, you can pass shuffle = true. This will add a shuffle step, but it also means that the reshuffled partitions will be using full cluster resources if possible.
I understand that coalesce means to take the data on some of the least data containing executors and shuffle them to already existing executors via a hash partitioner. I am not able to understand what the author is trying to say in this para though. Can somebody please explain to me what is being said in this paragraph?
Coalesce has some not so obvious effects due to Spark
Catalyst.
E.g.
Let’s say you had a parallelism of 1000, but you only wanted to write
10 files at the end. You might think you could do:
load().map(…).filter(…).coalesce(10).save()
However, Spark’s will effectively push down the coalesce operation to
as early a point as possible, so this will execute as:
load().coalesce(10).map(…).filter(…).save()
You can read in detail here an excellent article, that I quote from, that I chanced upon some time ago: https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908
In summary: Catalyst treatment of coalesce can reduce concurrency early in the pipeline. This I think is what is being alluded to, though of course each case is different and JOIN and aggregating are not subject to such effects in general due to 200 default partitioning that applies for such Spark operations.
As what you have said in your question "coalesce means to take the data on some of the least data containing executors and shuffle them to already existing executors via a hash practitioner". This effectively means the following
The number of partitions have reduced
The main difference between repartition and coalesce is that in coalesce the movement of the data across the partitions is fewer than in repartition thus reducing the level of shuffle thus being more efficient.
Adding the property shuffle=true is just to distribute the data evenly across the nodes which is the same as using repartition(). You can use shuffle=true if you feel that your data might get skewed in the nodes after performing coalesce.
Hope this answers your question

Spark map-side aggregation: Per partition only?

I have been reading on map-side reduce/aggregation and there is one thing I can't seem to understand clearly. Does it happen per partition only or is it broader in scope? I mean does it also reduce across partitions if the same key appears in multiple partitions processed by the same Executor?
Now I have a few more questions depending on whether the answer is "per partition only" or not.
Assuming it's per partition:
What are good ways to deal with a situation where I know my dataset lends itself well to reducing further across local partitions before a shuffle. E.g. I process 10 partitions per Executor and I know they all include many overlapping keys, so it could potentially be reduced to just 1/10th. Basically I'm looking for a local reduce() (like so many). Coalesce()ing them comes to mind, any common methods to deal with this?
Assuming it reduces across partitions:
Does it happen per Executor? How about Executors assigned to the same Worker node, do they have the ability to reduce across each others partitions recognizing that they are co-located?
Does it happen per core (Thread) within the Executor? The reason I'm asking this is because some of the diagrams I looked at seem to show a Mapper per core/Thread of the executor, it looks like results of all tasks coming out of that core goes to a single Mapper instance. (which does the shuffle writes if I am not mistaken)
Is it deterministic? E.g. if I have a record, let's say A=1 in 10 partitions processed by the same Executor, can I expect to see A=10 for the task reading the shuffle output? Or is it best-effort, e.g. it still reduces but there are some constraints (buffer size etc.) so the shuffle read may encounter A=4 and A=6.
Map side aggregation is similar to Hadoop combiner approach. Reduce locally makes sense to Spark as well and means less shuffling. So it works per partition - as you state.
When applying reducing functionality, e.g. a groupBy & sum, then shuffling occurs initially so that keys are in same partition, so that the above can occur (with dataframes automatically). But a simple count, say, will also reduce locally and then the overall count will be computed by taking the intermediate results back to the driver.
So, results are combined on the Driver from Executors - depending on what is actually requested, e.g. collect, print of a count. But if writing out after aggregation of some nature, then the reducing is limited to the Executor on a Worker.

Iterative Broadcast Join in Spark SQL

I recently came across this talk about dealing with Skew in Spark SQL by using "Iterative" Broadcast Joins to improve query performance when joining a large table with another not so small table. The talk advises to tackle such scenarios using "Iterative Broadcast Joins". Unfortunately, the talk doesn't probe deep enough for me to understand its implementation.
Hence, I was hoping if someone could please shed some light on how to implement this Iterative Broadcast Join in Spark SQL with few examples. How do I implement the same using Spark SQL queries with the SQL-API ?
Note: I am using Spark SQL query 2.4
Any help is appreciated. Thanks
Iterative Broadcast Join : large it might be worth considering the approach of iteratively taking slices of your smaller (but not that small) table, broadcasting those, joining with the larger table, then unioning the result.
To Solve this there is concept called:
i) Salting Technique: : In this we add a random number to a key and make data evenly distributed across clusters .Let see this through a example as below
In the above image Suppose we are performing a join on large and small table, data then is divided into three executors x,y and z as below and later union and since we have data skews all X will be in one executor and Y in another executor and z in another executor.
Since Y and Z data is relatively small it will get completed and wait for X-executor to complete which will take time.
SO to improve performance we should get X-executor data, evenly distributed across all executors
Since the data is stuck on one executor we will add a random number to all key (to both large and small table) and execute our process
Adding a random number : Key =explode(key, range(1,3)), which will give key_1,key_2,key_3
Now if you see is evenly distributed across executors, hence provides faster performance
If you need more help,please see this video :
https://www.youtube.com/watch?v=d41_X78ojCg&ab_channel=TechIsland
and this link:
https://dzone.com/articles/improving-the-performance-of-your-spark-job-on-ske#:~:text=Iterative%20(Chunked)%20Broadcast%20Join,table%2C%20then%20unioning%20the%20result.

Why one RDD count job takes so much time

I used newAPIHadoopRDD() method to load the HBase records to a RDD and do a simple count job.
However, this count job takes lots of time far more than I can imagine. I checked the codes, I am thinking may be in the HBase, one column family just has too much data, and when I load the records to the RDD, so much data may cause the executors memory overflow.
Is that possible this reason cause the issue?

How is task distributed in spark

I am trying to understand that when a job is submitted from the spark-submit and I have spark deployed system with 4 nodes how is the work distributed in spark. If there is large data set to operate on, I wanted to understand exactly in how many stages are the task divided and how many executors run for the job. Wanted to understand how is this decided for every stage.
It's hard to answer this question exactly, because there are many uncertainties.
Number of stages depends only on described workflow, which includes different kind of maps, reduces, joins, etc. If you understand it, you basically can read that right from the code. But most importantly that helps you to write more performant algorithms, because it's generally known the one have to avoid shuffles. For example, when you do a join, it requires shuffle - it's a boundary stage. This is pretty simple to see, you have to print rdd.toDebugString() and then look at indentation (look here), because indentation is a shuffle.
But with number of executors that's completely different story, because it depends on number of partitions. It's like for 2 partitions it requires only 2 executors, but for 40 ones - all 4, since you have only 4. But additionally number of partitions might depend on few properties you can provide at the spark-submit:
spark.default.parallelism parameter or
data source you use (f.e. for HDFS and Cassandra it is different)
It'd be a good to keep all of the cores in cluster busy, but no more (meaning single process only just one partition), because processing of each partition takes a bit of overhead. On the other hand if your data is skewed, then some cores would require more time to process bigger partitions, than others - in this case it helps to split data to more partitions so that all cores are busy roughly same amount of time. This helps with balancing cluster and throughput at the same time.

Resources