Can reduced parallelism lead to no shuffle spill? - apache-spark

Consider an example:
I have a cluster with 5 nodes and each node has 64 cores with 244 GB memory.
I decide to run 3 executors on each node and set executor-cores to 21 and executor memory of 80GB, so that each executor can execute 21 tasks in parallel. Now consider that 315(63 * 5) partitions of data, out of which 314 partitions are of size 3GB but one of them is 30GB(due to data skew).
All of the executors that received the 3GB partitions have 63GB(21 * 3 = since each executor can run 21 tasks in parallel and each task takes 3GB of memory space) occupied.
But the one executor that received the 30GB partition will need 90GB(20 * 3 + 30) memory. So will this executor first execute the 20 tasks of 3GB and then load 30GB task or will it just try to load 21 tasks and find that for one task it has to spill to disk? If I set executor-cores to just 15 then the executor that receives the 30 GB partition will only need 14 * 3 + 30 = 72 gb and hence won't spill to disk.
So in this case will reduced parallelism lead to no shuffle spill?

#Venkat Dabri ,
Could you please format the questions with appropriate carriage return/spaces ?
Here are few pointers
Spark (Shuffle)Map Stage ==> the size of each partition depends on filesystem's block size. E.g. if data is read from HDFS , each partition will try to have data as close as 128MB so for input data number of partitions = floor(number of files * blocksize/128 (actually 122.07 as Mebibyte is used))
Now the scenario you are describing is for Shuffled data in Reducer(Result Stage)
Here the blocks processed by reducer tasks are called Shuffled Blocks and By default Spark ( for SQL/Core APIs) will launch 200 reducer tasks
Now important thing to remember Spark can hold Max 2GB so if you have too few partitions and one of them does a remote fetch of a shuffle block > 2GB, you will see an error like Size exceeds Integer.MAX_VALUE
To mitigate that , within default limit Spark employs many optimization (compression/tungsten-sort-shuffle etc) but as a developer we can try to repartition skewed data intelligently and tune default parallelism

Related

Can spark manage partitions larger than the executor size?

Question:
Spark seems to be able to manage partitions that are bigger than the executor size. How does it do that?
What I have tried so far:
I picked up a CSV with: Size on disk - 12.3 GB, Size in memory deserialized - 3.6 GB, Size in memory serialized - 1964.9 MB. I got these sizes from caching the data in memory deserialized and serialized both and 12.3 GB is the size of the file on the disk.
To check if spark can handle partitions larger than the executor size, I created a cluster with just one executor with spark.executor.memory equal to 500mb. Also, I set executor cores (spark.executor.cores) to 2 and, increased spark.sql.files.maxPartitionBytes to 13 GB. I also switched off Dynamic allocation and adaptive for good measure. The entire session configuration is:
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","2").\
config("spark.executor.instances","1").\
config("spark.executor.memory","500m").\
config("spark.sql.adaptive.enabled", False).\
config("spark.sql.files.maxPartitionBytes","13g").\
getOrCreate()
I read the CSV and checked the number of partitions that it is being read in by df.rdd.getNumPartitions(). Output = 2. This would be confirmed later on as well in the number of tasks
Then I run df.persist(storagelevel.StorageLevel.DISK_ONLY); df.count()
Following are the observations I made:
No caching happens till the data for one batch of tasks (equal to number of cpu cores in case you have set 1 cpu core per task) is read in completely. I conclude this since there is no entry that shows up in the storage tab of the web UI.
Each partition here ends up being around 6 GB on disk. Which should, at a minimum, be around 1964.9 MB/2 (=Size in memory serializez/2) in memory. Which is around 880 MB. There is no spill. Below is the relevant snapshot of the web UI from when around 11 GB of the data has been read in. You can see that Input has been almost 11GB and at this time there was nothing in the storage tab.
Questions:
Since the memory per executor is 300 MB (Execution + Storage) + 200 MB (User memory). How is spark able to manage ~880 MB partitions that too 2 of them in parallel (one by each core)?
The data read in does not show up in the Storage, is not (and, can not be) in the executor and, there is no spill as well. where exactly is that read in data?
Attaching a SS of the web UI post that job completion in case that might be useful
Attaching a SS of the Executors tab in case that might be useful:

In a spark cluster, the read and writes of files are dependent on what factor, number of executors or number of cores?

Let say I have 128 GB dataset and read as spark dataframe. I set config as,
executor cores = 4
number of partitions = 1000
maxPartitionBytes = 128 MB
Going by the above information, the number of executors is 250.
How many files can be read/write in parallel into this cluster? Is it 250 or 1000?
I know that 1000 files will be written if there is 1000 partition but is the 1000 file written in parallel at the same time or is it written as 250 files, four times consecutively?
Is the read/write dependent on number of executor or the number of core itself?
In spark, No. of partition defines the level of concurrency that can be achieved. This means if we have 1000 partitions like in your case you can process 1000 partitions in parallel using 1000 executor cores.
Spark Performs tasks on these partitions, where each task is handled by an executor core so the level of parallelism is 4
Note:- no. of executor = no. of cores assigned to perform the task. One core is always reserved by the system to perform the task.
You can check out this article
https://www.sparkcodehub.com/spark-partitioning-shuffle

Spark partition size greater than the executor memory

I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (Total 6 executors, 27 cores and 15gb memory). What will happen if:
I have 30 data partitions. Each partition is of size 6 gb. Optimally, the number of partitions must be equal to number of cores, since each core executes one partition/task (One task per partition). Now in this case, how will each executor-core will process the partition since partition size is greater than the available executor memory? Note: I'm not calling cache() or persist(), it's simply that i'm applying some narrow transformations like map() and filter() on my rdd.
Will spark automatically try to store the partitions on disk? (I'm not calling cache() or persist() but merely just transformations are happening after an action is called)
Since I have partitions (30) greater than the number of available cores (27) so at max, my cluster can process 27 partitions, what will happen to the remaining 3 partitions? Will they wait for the occupied cores to get freed?
If i'm calling persist() whose storage level is set to MEMORY_AND_DISK, then if partition size is greater than memory, it will spill data to the disk? On which disk this data will be stored? The worker node's external HDD?
I answer as I know things on each part, possibly disregarding a few of your assertions:
I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (Total 6 executors, 27 cores and 15gb memory). What will happen if:
>>> I would use 1 Executor, 1 Core. That is the generally accepted paradigm afaik.
I have 30 data partitions. Each partition is of size 6 gb. Optimally, the number of partitions must be equal to number of cores, since each core executes one partition/task (One task per partition). Now in this case, how will each executor-core will process the partition since partition size is greater than the available executor memory? Note: I'm not calling cache() or persist(), it's simply that I'm applying some narrow transformations like map() and filter() on my rdd. >>> The number of partitions being the same of number of cores is not true. You can service 1000 partitions with 10 cores, processing one at a time. What if you have 100K partition and on-prem? Unlikely you will get 100K Executors. >>> Moving on and leaving Driver-side collect issues to one side: You may not have enough memory for a given operation on an Executor; Spark can spill to files to disk at the expense of speed of processing. However, the partition size should not exceed a maximum size, was beefed up some time ago. Using multi-core Executors failure can occur, i.e. OOM's, also a result of GC-issues, a difficult topic.
Will spark automatically try to store the partitions on disk? (I'm not calling cache() or persist() but merely just transformations are happening after an action is called) >>> Not if it can avoid it, but when memory is tight, eviction / spilling to disk can and will occur, and in some cases re-computation from source or last checkpoint will occur.
Since I have partitions (30) greater than the number of available cores (27) so at max, my cluster can process 27 partitions, what will happen to the remaining 3 partitions? Will they wait for the occupied cores to get freed? >>> They will be serviced by a free Executor at a point in time.
If I'm calling persist() whose storage level is set to MEMORY_AND_DISK, then if partition size is greater than memory, it will spill data to the disk? On which disk this data will be stored? The worker node's external HDD? >>> Yes, and it will be spilled to the local file system. I think you can configure for HDFS via a setting, but local disks are faster.
This an insightful blog: https://medium.com/swlh/spark-oom-error-closeup-462c7a01709d
Your data partition size looks bigger than your Core memory. Your Core memory is ~1.6 GB (5GB/3 Core). This will be a problem as your partition will not be able to process in the Core. To resolve this, you can try:
increasing the number of partitions such that each partition is < Core memory ~1.6 GB. So increase them to something like 150 partitions.
If you keep the partitions the same, you should try increasing your Executor memory and maybe also reducing number of Cores in your Executors.
If everything goes well it will not need to store partitions on disk. However, if it is not able to find enough memory, it will find disk as a backup. If you want to store your data on Disk and persist it for some reason, you need to call persist(DISK_ONLY).
They will wait until one of the Cores is available.
Yes, it will spill on Disk. Where will depend on your cluster configuration I believe.

Spark Shuffle partition - if I have shuffle partition less than number of cores what would happen?

I am using databricks with Azure, so I don't have a way to provide the number of executors and memory per executors.
Let's consider I have the following configuration.
10 Worker nodes, each with 4 cores and 10 GB of memory.
it's a standalone configuration
input read size is 100 GB
now if I set my shuffle partition to 10, (less than total cores, 40). What would happen?
will it create total of 10 executors, one per node, with each executor occupying all the cores and all the memory?
If you don't use dynamic allocation, you will end up leaving most cores unused during execution. Think about you have 40 "slots" for computation available, but only 10 tasks to process, so 30 "slots" will be empty (just idle).
I have to add that the above is a very simplified situation. In reality, you can have multiple stages running in parallel, so depending on your query, you will still have all 40 cores utilized (see e.g. Does stages in an application run parallel in spark?)
Note also that spark.sql.shuffle.partitions is not the only parameter which determines the number of tasks/partitions. You can have different number of partitions for
reading files
if you modify your query using repartition, e.g. when using :
df
.repartition(100,$"key")
.groupBy($"key").count
your value of spark.sql.shuffle.partitions=10 will be overwritten by 100 in this exchange step
What your discribing as an expectation is named dynamic allocation on Spark. You can provide min and max allocation and then depending on amount of partiton the framework will be scaled. https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
But with only 10 partition on a 100 gb file you will have outOfMemoryErrors

spark : HDFS blocks vs Cluster cores vs rdd Partitions

I do have a doubt on spark : HDFS blocks vs Cluster cores vs rdd Partitions.
Assume I am trying to process a file in HDFS (say block size is 64 MB and file is 6400 MB). So Ideally it do have 100 splits.
My cluster do have 200 cores in total , and I submitted the jobs with 25 Executors with 4 cores each (means 100 parallel tasks can run).
In nutshell I do have 100 partition by default in rdd and 100 cores will run.
Is this a good approach , or should I repartition the data to 200 partition and use all core in cluster ?
Since you have 200 cores in total, using all of them can improve the performance depending on what kind of workload you are running.
Configure your spark application to use 50 executor (i.e. all 200 cores can be used by Spark). Also Change your spark split size from 64 MB to 32 MB. This will make sure that 6400 MB file will be divided into 200 RDD partitions and so your entire cluster can be used by it.
Don't use repartition - it will be slow as it involves shuffle.

Resources