Spark task duration difference - apache-spark

I'm running application that loads data (.csv) from s3 into DataFrames, and than register those Dataframes as temp tables. After that, I use SparkSQL to join those tables and finally write result into db. Issue that is currently bottleneck for me is that I feel tasks are not evenly split and i get no benefits or parallelization and multiple nodes inside cluster. More precisely, this is distribution of task duration in problematic stage
task duration distribution
Is there way for me to enforce more balanced distribution ? Maybe manually writing map/reduce functions ?
Unfortunately, this stage has 6 more tasks that are still running (1.7 hours atm), which will prove even greater deviation.

There are two likely possibilities: one is under your control and .. unfortunately one is likely not ..
Skewed data. Check that the partitions are of relatively similar size - say within a factor of three or four.
Inherent variability of Spark tasks runtime. I have seen behavior of large delays in stragglers on Spark Standalone, Yarn, and Mesos without an apparent reason. The symptoms are:
extended periods (minutes) where little or no cpu or disk activity were occurring on the nodes hosting the straggler tasks
no apparent correlation of data size to the stragglers
different nodes/workers may experience the delays on subsequent runs of the same job
One thing to check: do hdfs dfsadmin -report and hdfs fsck to see if hdfs were healthy.

Related

Apache Spark: is it possible to get dataset counts in a spark job?

Sometimes some Spark job which runs in our cluster runs too long not because of bad optimization, but because of bad logic of the algorithm. In most cases this is a consequence of some unnecessary joins that produce too many rows. Normally we spot such jobs by looking at Spark execution plan where we can find such joins by looking at "number of output rows: xxx" in blue stage labels.
I want to understand - is it possible to optimize this procedure and somehow automatically notify the programmer that the job has too many rows in some dataset (after execution)?
Maybe we can print this in logs (without manually counting dataset's size in code)?
Maybe after running the job we can get the output of the execution plan somehow and save it for further investigations?
No, it's not an option. Spark will do its best to optimize the query plan, so manual interaction with lower execution level is pretty much limited. However, you can "control" the rows for each jobs/tasks by changing some configurations (like spark.sql.shuffle.partitions or spark.sql.files.maxPartitionBytes), or by repartitioning data, which will cause data to be shuffled and re-distributed nearly equally between executors.

How do you determine shuffle partitions for Spark application?

I am new to spark so am following this amazing tutorial from sparkbyexamples.com and while reading I found this section:
Shuffle partition size & Performance
Based on your dataset size, a number of cores and memory PySpark
shuffling can benefit or harm your jobs. When you dealing with less
amount of data, you should typically reduce the shuffle partitions
otherwise you will end up with many partitioned files with less number
of records in each partition. which results in running many tasks with
lesser data to process.
On other hand, when you have too much of data and having less number
of partitions results in fewer longer running tasks and some times you
may also get out of memory error.
Getting the right size of the shuffle partition is always tricky and
takes many runs with different values to achieve the optimized number.
This is one of the key properties to look for when you have
performance issues on PySpark jobs.
Can someone help me understand how do you determine how many shuffle partitions you will need for your job?
As you quoted, it’s tricky, but this is my strategy:
If you’re using “static allocation”, means you tell Spark how many executors you want to allocate for the job, then it’s easy, number of partitions could be executors * cores per executor * factor. factor = 1 means each executor will handle 1 job, factor = 2 means each executor will handle 2 jobs, and so on
If you’re using “dynamic allocation”, then it’s trickier. You can read the long description here https://databricks.com/blog/2021/03/17/advertising-fraud-detection-at-scale-at-t-mobile.html. The general idea is you need to answer many questions like what’s the size if your data (how big in terms of gigabytes), how its structure looks like (how many files, how many folders, how many rows etc), how would you read it (from hdfs or from hive or from jdbc), how much resources do you have (cores, executors, memory), … Then you run and benchmark over and over to find the sweet spot that is “just right” for your circumstances.
Update #1:
So what is the general industry practice, will a company simply use first tactic and allocate more hardware or they will use dynamic allocation?
Usually, if you have an on-premise Hadoop environment, you can choose between static (default mode) and dynamic allocation (advanced mode). Also, I often start with dynamic because I have no idea how big the data and its transformation is, so stick with dynamic give me flexibility to expand my work without thinking too much about Spark configuration. But you also can start with static if you want to, nothing preventing you to do so.
Then eventually, when it came to productionize process, you also can choose between static (very stable but consumes more resources) vs dynamic (less stable, i.e fail sometimes due to resources allocation, but save resources.
Finally, most Hadoop cloud solution (like Databricks) come with dynamic allocation by default, which is is less costly.

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

Processing Pipeline using Spark SQL- jobs, stages and DAG sizes

I have a processing pipeline that is built using Spark SQL. The objective is to read data from Hive in the first step and apply a series of functional operations (using Spark SQL) in order to achieve the functional output. Now, these operations are quite in number (more than 100), which means I am running around 50 to 60 spark sql queries in a single pipeline. While the application completes successfully without any issues, my focus area has shifted to optimizing the overall process. I have been able to speed up the executions using spark.sql.shuffle.partitions, changing the executor memory and reducing the size of the spark.memory.fraction from default 0.6 to 0.2. I got great benefits by doing all these changes and the over all execution time reduced from 20-25 mins to around 10 mins. Data volume is around 100k rows (source side).
The observations that I have from the Cluster are:
-The number of jobs triggered as apart of application id are 235.
-The total number of stages across all the jobs created are around 600.
-8 executors are used in a two node cluster (64 GB RAM in total with 10 cores).
-The resource manager UI of Yarn (for an application id) becomes very slow to retrieve the details of jobs/stages.
In one of the videos of Spark tuning, I heard that we should try to reduce the number of stages to a bare minimum, also DAG size should be smaller. What are the guidelines to do this. How to find the number of shuffles that are happening (my SQLs have many joins and group by clauses).
I would like to have suggestions on the above scenario of what possible things I can do in order to improvise the performance and handle the data skews in the SQL queries that are JOIN/GROUP_BY heavy.
Thanks

How is task distributed in spark

I am trying to understand that when a job is submitted from the spark-submit and I have spark deployed system with 4 nodes how is the work distributed in spark. If there is large data set to operate on, I wanted to understand exactly in how many stages are the task divided and how many executors run for the job. Wanted to understand how is this decided for every stage.
It's hard to answer this question exactly, because there are many uncertainties.
Number of stages depends only on described workflow, which includes different kind of maps, reduces, joins, etc. If you understand it, you basically can read that right from the code. But most importantly that helps you to write more performant algorithms, because it's generally known the one have to avoid shuffles. For example, when you do a join, it requires shuffle - it's a boundary stage. This is pretty simple to see, you have to print rdd.toDebugString() and then look at indentation (look here), because indentation is a shuffle.
But with number of executors that's completely different story, because it depends on number of partitions. It's like for 2 partitions it requires only 2 executors, but for 40 ones - all 4, since you have only 4. But additionally number of partitions might depend on few properties you can provide at the spark-submit:
spark.default.parallelism parameter or
data source you use (f.e. for HDFS and Cassandra it is different)
It'd be a good to keep all of the cores in cluster busy, but no more (meaning single process only just one partition), because processing of each partition takes a bit of overhead. On the other hand if your data is skewed, then some cores would require more time to process bigger partitions, than others - in this case it helps to split data to more partitions so that all cores are busy roughly same amount of time. This helps with balancing cluster and throughput at the same time.

Resources