Spark+yarn - scale memory with input size - apache-spark

I am running a spark on yarn cluster with pyspark. I have a dataset which requires loading several binary files per key, and then running some calculation that is difficult to decompose into parts - so it generally has to operate across all the data for a single key.
Currently, I set spark.executor.memory and spark.yarn.executor.memoryOverhead to "sane" values that work most of the time, however certain keys end up having a much larger amount of data than the average, and in these cases, the memory is insufficient and the executor ends up getting killed.
I currently do one of the following:
1) Run jobs with the default memory setting and just rerun when certain keys fail with more memory
2) If I know one of my keys has much more data, I can scale up the memory for the job as a whole, however this has the downside of drastically reducing the number of running containers I get / number of jobs running in parallel.
Ideally I would have a system where I could send off a job and have the memory in an executor scale with input size, however I know that's not spark's model. Are there any extra settings that can help me here or any tricks for dealing with this problem? Anything obvious I'm missing as a fix?

You can test the following approach: set executor memory and executor yarn overhead to your max values and add spark.executor.cores with number greater than 1 (start with 2). Additionally set spark.task.maxFailures to some big number (lets say 10).
Then on normal-sized keys spark will probably finish tasks as usual but some partitions with larger keys with fail. They will be added to retry stage and since number of partitions to retry will be much lower than initial partitions, spark will distribute them evenly to executors. If number of partitions will be lower or equal number of executors, every partition will have twice memory compared to initial execution and may succeed.
Let me know if it will work for you.

Related

How do I make my Spark job run faster using executors?

I know my code is free from antipatterns since I don't have any warnings in my Authoring code editor, so I know my code is doing PySpark operations that are distributed and scalable.
My current job has 2 executors assigned to it with 2 cores each, and it runs with task parallelism of 16 as seen on the Spark Details page.
How do I make this job run faster?
Your Executors are the pieces of Spark infrastructure assigned to 'execute' your work. As such, the more of these 'workers' you have, the more work you are able to do in parallel and the faster your job will be.
There's a limit to the amount your job will increase in speed however, and this is a function of the max number of tasks in your stages. Note: with AQE, your max number of tasks will increase as you increase your executor count, so you will notice the task counts increasing up to a certain point.
For instance, if my data scale is such that I only ever have a maximum of 8 tasks (let's assume AQE is controlling this), assigning an executor count to run more than 8 tasks will waste resources and won't increase your job speed (see above note that AQE may adjust your task counts as you add executors since it's detecting that more work can be run in parallel).
The job defaults in most Foundry environments are 2 executors with 2 cores each, and 1 core per task. This means your job is capable of running 4 cores at a time, which means 4 tasks.
This means if your max task counts per stage in your job is 4, you won't benefit from boosting your number of executors. If, however, you observe your stages have, for instance, 16 tasks, then you can choose to increase the number of executors in your job as such:
16 max tasks, 1 core per task. -> 16 cores needed.
2 cores per executor -> 8 executors max.
We could therefore jump this example job up to 8 executors for maximum performance.
For the original question, you would bump the number of executors to 8 for maximum performance, assuming AQE hasn't increased your task counts following.
When AQE re-examines your job and your new count of Executors, it will detect that more tasks can be run in parallel and will therefore increase your task counts to try to match the infrastructure. However, when it does this, you might end up with tasks that are smaller than you would like.
The way AQE decides how big to make the tasks (and therefore how many tasks it will run with) is based on the setting spark.sql.adaptive.advisoryPartitionSizeInBytes and the total number of cores available in your job. If you have more cores than would be worth parallelizing (i.e. the shuffle partitions are too small), then these small partitions will be coalesced into a fewer count which will mean you then have the same wasted executor problem without AQE.
AQE will do the best it can with the executor counts you've given it, so you may see it get faster and faster with more executors up to a point. At the point more executors doesn't mean a faster job, this is because your partition sizes are too small to be worth splitting into smaller tasks, and you've started wasting executors

How to tune Spark to avoid sort disk spill?

We have an algorithm that currently processes data in partition-by-partition manner foreachPartition. I realize this might not be the best way to process data in Spark but, in theory, we should be able to make it work.
We have an issue with Spark spilling data after sortWithinPartitions call with approx 45 GB of data in partition. Our executor has 250 GB memory defined. In theory, there is enough space to fit the data in the memory (unless Spark's overhead for sorting is huge). Yet we experience the spills. Is there a way accurately calculate how much memory per executor we'd need to make it work?
You need to decrease a number of tasks per executor or to increase memory.
The first option is just to decrease spark.executor.cores
The second option obviously is to increase spark.executor.memory
The third option is to increase spark.task.cpus because number of tasks per executor are spark.executor.cores / spark.task.cpus. For example you set spark.executor.cores=4 and spark.task.cpus=2. Number of tasks in that case would be 4/2=2. And if you don't set spark.task.cpus, by default it is 1, 4/1=4 tasks which would consume much more memory.
I prefer the third option because it keeps a balance between occupied memory and cores and allow to use more than one core per task.

How do you determine shuffle partitions for Spark application?

I am new to spark so am following this amazing tutorial from sparkbyexamples.com and while reading I found this section:
Shuffle partition size & Performance
Based on your dataset size, a number of cores and memory PySpark
shuffling can benefit or harm your jobs. When you dealing with less
amount of data, you should typically reduce the shuffle partitions
otherwise you will end up with many partitioned files with less number
of records in each partition. which results in running many tasks with
lesser data to process.
On other hand, when you have too much of data and having less number
of partitions results in fewer longer running tasks and some times you
may also get out of memory error.
Getting the right size of the shuffle partition is always tricky and
takes many runs with different values to achieve the optimized number.
This is one of the key properties to look for when you have
performance issues on PySpark jobs.
Can someone help me understand how do you determine how many shuffle partitions you will need for your job?
As you quoted, it’s tricky, but this is my strategy:
If you’re using “static allocation”, means you tell Spark how many executors you want to allocate for the job, then it’s easy, number of partitions could be executors * cores per executor * factor. factor = 1 means each executor will handle 1 job, factor = 2 means each executor will handle 2 jobs, and so on
If you’re using “dynamic allocation”, then it’s trickier. You can read the long description here https://databricks.com/blog/2021/03/17/advertising-fraud-detection-at-scale-at-t-mobile.html. The general idea is you need to answer many questions like what’s the size if your data (how big in terms of gigabytes), how its structure looks like (how many files, how many folders, how many rows etc), how would you read it (from hdfs or from hive or from jdbc), how much resources do you have (cores, executors, memory), … Then you run and benchmark over and over to find the sweet spot that is “just right” for your circumstances.
Update #1:
So what is the general industry practice, will a company simply use first tactic and allocate more hardware or they will use dynamic allocation?
Usually, if you have an on-premise Hadoop environment, you can choose between static (default mode) and dynamic allocation (advanced mode). Also, I often start with dynamic because I have no idea how big the data and its transformation is, so stick with dynamic give me flexibility to expand my work without thinking too much about Spark configuration. But you also can start with static if you want to, nothing preventing you to do so.
Then eventually, when it came to productionize process, you also can choose between static (very stable but consumes more resources) vs dynamic (less stable, i.e fail sometimes due to resources allocation, but save resources.
Finally, most Hadoop cloud solution (like Databricks) come with dynamic allocation by default, which is is less costly.

Can the number of executor cores be greater than the total number of spark tasks? [duplicate]

What happens when number of spark tasks be greater than the executor core? How is this scenario handled by Spark
Is this related to this question?
Anyway, you can check this Cloudera How-to. In "Tuning Resource Allocation" section, It's explained that a spark application can request executors by turning on the dynamic allocation property. It's also important to set cluster properties such as num-executors, executor-cores, executor-memory... so that spark requests fit into what your resource manager has available.
yes, this scenario can happen. In this case some of the cores will be idle. Scenarios where this can happen:
You call coalesce or repartition with a number of partitions < number of cores
you use the default number of spark.sql.shuffle.partitions (=200)
and you have more than 200 cores available. This will be an issue for
joins, sorting and aggregation. In this case you may want to increase spark.sql.shuffle.partitions
Note that even if you have enough tasks, some (or most) of them could be empty. This can happen if you have a large data skew or you do something like groupBy() or Window without a partitionBy. In this case empty partitions will be finished immediately, turning most of your cores idle
I think the question is a little off beam. It is unlikely what you ask. Why?
With a lot of data you will have many partitions and you may repartition.
Say you have 10,000 partitions which equates to 10,000 tasks.
An executor (core) will serve a partition effectively a task (1:1 mapping) and when finished move on to the next task, until all tasks finished in the stage and then next will start (if it is in plan / DAG).
It's more likely you will not have a cluster of 10,000 executor cores at most places (for your App), but there are sites that have that, that is true.
If you have more cores allocated than needed, then they remain idle and non-usable for others. But with dynamic resource allocation, executors can be relinquished. I have worked with YARN and Spark Standalone, how this is with K8 I am not sure.
Transformations alter what you need in terms of resources. E.g. an order by may result in less partitions and thus may contribute to idleness.

Spark SQL slow execution with resource idle

I have a Spark SQL that used to execute < 10 mins now running at 3 hours after a cluster migration and need to deep dive on what it's actually doing. I'm new to spark and please don't mind if I'm asking something unrelated.
Increased spark.executor.memory but no luck.
Env: Azure HDInsight Spark 2.4 on Azure Storage
SQL: Read and Join some data and finally write result to a Hive metastore.
The spark.sql script ends with below code:
.write.mode("overwrite").saveAsTable("default.mikemiketable")
Application Behavior:
Within the first 15 mins, it loads and complete most tasks (199/200); left only 1 executor process alive and continually to shuffle read / write data. Because now it only leave 1 executor, we need to wait 3 hours until this application finish.
Left only 1 executor alive
Not sure what's the executor doing:
From time to time, we can tell the shuffle read increased:
Therefore I increased the spark.executor.memory to 20g, but nothing changed. From Ambari and YARN I can tell the cluster has many resources left.
Release of almost all executor
Any guidance is greatly appreciated.
I would like to start with some observations for your case:
From the tasks list you can see that that Shuffle Spill (Disk) and Shuffle Spill (Memory) have both very high values. The max block size for each partition during the exchange of data should not exceed 2GB therefore you should be aware to keep the size of shuffled data as low as possible. As rule of thumb you need to remember that the size of each partition should be ~200-500MB. For instance if the total data is 100GB you need at least 250-500 partitions to keep the partition size within the mentioned limits.
The co-existence of two previous it also means that the executor memory was not sufficient and Spark was forced to spill data to the disk.
The duration of the tasks is too high. A normal task should lasts between 50-200ms.
Too many killed executors is another sign which shows that you are facing OOM problems.
Locality is RACK_LOCAL which is considered one of the lowest values you can achieve within a cluster. Briefly, that means that the task is being executed in a different node than the data is stored.
As solution I would try the next few things:
Increase the number of partitions by using repartition() or via Spark settings with spark.sql.shuffle.partitions to a number that meets the requirements above i.e 1000 or more.
Change the way you store the data and introduce partitioned data i.e day/month/year using partitionBy

Resources