Spark Shuffle partition - if I have shuffle partition less than number of cores what would happen? - apache-spark

I am using databricks with Azure, so I don't have a way to provide the number of executors and memory per executors.
Let's consider I have the following configuration.
10 Worker nodes, each with 4 cores and 10 GB of memory.
it's a standalone configuration
input read size is 100 GB
now if I set my shuffle partition to 10, (less than total cores, 40). What would happen?
will it create total of 10 executors, one per node, with each executor occupying all the cores and all the memory?

If you don't use dynamic allocation, you will end up leaving most cores unused during execution. Think about you have 40 "slots" for computation available, but only 10 tasks to process, so 30 "slots" will be empty (just idle).
I have to add that the above is a very simplified situation. In reality, you can have multiple stages running in parallel, so depending on your query, you will still have all 40 cores utilized (see e.g. Does stages in an application run parallel in spark?)
Note also that spark.sql.shuffle.partitions is not the only parameter which determines the number of tasks/partitions. You can have different number of partitions for
reading files
if you modify your query using repartition, e.g. when using :
df
.repartition(100,$"key")
.groupBy($"key").count
your value of spark.sql.shuffle.partitions=10 will be overwritten by 100 in this exchange step

What your discribing as an expectation is named dynamic allocation on Spark. You can provide min and max allocation and then depending on amount of partiton the framework will be scaled. https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
But with only 10 partition on a 100 gb file you will have outOfMemoryErrors

Related

In a spark cluster, the read and writes of files are dependent on what factor, number of executors or number of cores?

Let say I have 128 GB dataset and read as spark dataframe. I set config as,
executor cores = 4
number of partitions = 1000
maxPartitionBytes = 128 MB
Going by the above information, the number of executors is 250.
How many files can be read/write in parallel into this cluster? Is it 250 or 1000?
I know that 1000 files will be written if there is 1000 partition but is the 1000 file written in parallel at the same time or is it written as 250 files, four times consecutively?
Is the read/write dependent on number of executor or the number of core itself?
In spark, No. of partition defines the level of concurrency that can be achieved. This means if we have 1000 partitions like in your case you can process 1000 partitions in parallel using 1000 executor cores.
Spark Performs tasks on these partitions, where each task is handled by an executor core so the level of parallelism is 4
Note:- no. of executor = no. of cores assigned to perform the task. One core is always reserved by the system to perform the task.
You can check out this article
https://www.sparkcodehub.com/spark-partitioning-shuffle

Spark partition size greater than the executor memory

I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (Total 6 executors, 27 cores and 15gb memory). What will happen if:
I have 30 data partitions. Each partition is of size 6 gb. Optimally, the number of partitions must be equal to number of cores, since each core executes one partition/task (One task per partition). Now in this case, how will each executor-core will process the partition since partition size is greater than the available executor memory? Note: I'm not calling cache() or persist(), it's simply that i'm applying some narrow transformations like map() and filter() on my rdd.
Will spark automatically try to store the partitions on disk? (I'm not calling cache() or persist() but merely just transformations are happening after an action is called)
Since I have partitions (30) greater than the number of available cores (27) so at max, my cluster can process 27 partitions, what will happen to the remaining 3 partitions? Will they wait for the occupied cores to get freed?
If i'm calling persist() whose storage level is set to MEMORY_AND_DISK, then if partition size is greater than memory, it will spill data to the disk? On which disk this data will be stored? The worker node's external HDD?
I answer as I know things on each part, possibly disregarding a few of your assertions:
I have four questions. Suppose in spark I have 3 worker nodes. Each worker node has 3 executors and each executor has 3 cores. Each executor has 5 gb memory. (Total 6 executors, 27 cores and 15gb memory). What will happen if:
>>> I would use 1 Executor, 1 Core. That is the generally accepted paradigm afaik.
I have 30 data partitions. Each partition is of size 6 gb. Optimally, the number of partitions must be equal to number of cores, since each core executes one partition/task (One task per partition). Now in this case, how will each executor-core will process the partition since partition size is greater than the available executor memory? Note: I'm not calling cache() or persist(), it's simply that I'm applying some narrow transformations like map() and filter() on my rdd. >>> The number of partitions being the same of number of cores is not true. You can service 1000 partitions with 10 cores, processing one at a time. What if you have 100K partition and on-prem? Unlikely you will get 100K Executors. >>> Moving on and leaving Driver-side collect issues to one side: You may not have enough memory for a given operation on an Executor; Spark can spill to files to disk at the expense of speed of processing. However, the partition size should not exceed a maximum size, was beefed up some time ago. Using multi-core Executors failure can occur, i.e. OOM's, also a result of GC-issues, a difficult topic.
Will spark automatically try to store the partitions on disk? (I'm not calling cache() or persist() but merely just transformations are happening after an action is called) >>> Not if it can avoid it, but when memory is tight, eviction / spilling to disk can and will occur, and in some cases re-computation from source or last checkpoint will occur.
Since I have partitions (30) greater than the number of available cores (27) so at max, my cluster can process 27 partitions, what will happen to the remaining 3 partitions? Will they wait for the occupied cores to get freed? >>> They will be serviced by a free Executor at a point in time.
If I'm calling persist() whose storage level is set to MEMORY_AND_DISK, then if partition size is greater than memory, it will spill data to the disk? On which disk this data will be stored? The worker node's external HDD? >>> Yes, and it will be spilled to the local file system. I think you can configure for HDFS via a setting, but local disks are faster.
This an insightful blog: https://medium.com/swlh/spark-oom-error-closeup-462c7a01709d
Your data partition size looks bigger than your Core memory. Your Core memory is ~1.6 GB (5GB/3 Core). This will be a problem as your partition will not be able to process in the Core. To resolve this, you can try:
increasing the number of partitions such that each partition is < Core memory ~1.6 GB. So increase them to something like 150 partitions.
If you keep the partitions the same, you should try increasing your Executor memory and maybe also reducing number of Cores in your Executors.
If everything goes well it will not need to store partitions on disk. However, if it is not able to find enough memory, it will find disk as a backup. If you want to store your data on Disk and persist it for some reason, you need to call persist(DISK_ONLY).
They will wait until one of the Cores is available.
Yes, it will spill on Disk. Where will depend on your cluster configuration I believe.

Spark performance tuning - number of executors vs number for cores

I have two questions around performance tuning in Spark:
I understand one of the key things for controlling parallelism in the spark job is the number of partitions that exist in the RDD that is being processed, and then controlling the executors and cores processing these partitions. Can I assume this to be true:
# of executors * # of executor cores shoud be <= # of partitions. i.e to say one partition is always processed in one core of one executor. There is no point having more executors*cores than the number of partitions
I understand that having a high number of cores per executor can have a -ve impact on things like HDFS writes, but here's my second question, purely from a data processing point of view what is the difference between the two? For e.g. if I have 10 node cluster what would be the difference between these two jobs (assuming there's ample memory per node to process everything):
5 executors * 2 executor cores
2 executors * 5 executor cores
Assuming there's infinite memory and CPU, from a performance point of view should we expect the above two to perform the same?
Most of the time using larger executors (more memory, more cores) are better. One: larger executor with large memory can easily support broadcast joins and do away with shuffle. Second: since tasks are not created equal, statistically larger executors have better chance of surviving OOM issues.
The only problem with large executors is GC pauses. G1GC helps.
In my experience, if I had a cluster with 10 nodes, I would go for 20 spark executors. The details of the job matter a lot, so some testing will help determine the optional configuration.

Spark repartition do not divide data to all executors

I have 8 executors with 4 core each , I repartition a rdd to 32. I expects all 8 executors play a part on on the next action that i call on the repartitioned data. But seems like sometime 3 executors participate and sometimes 4 but not more than that.
How can i ensure that the data gets divided on all executors?
rdd.repartition(32).foreachPartition{ part =>
updateMem(part)
}
The last part calls inser/update into memsql.
Below answer is valid only if you are using AWS- EMR.
I dont think it is correct to say that you have 8 executors of 4 cores each. here is the explanation. Say, I am using m3.2xlarge machine (EMR).
each machine contains 30 GB memory(total) 8 vcores
There is no way that you can use all the 30 GB memory for executors
as machine would need some memory for its own use.
You would like to leave enough memory for machine own use (like OS
and other usage) so that there will not be any system failure.
say, you want to leave 10GB memory for machines then you are left
with 20GB memory
In 20 GB memory I can have 6 executors (3GB each, 6*3GB =18GB) , can
have 4 executors (5GB each, 4*5GB = 20GB) etc
so, you can decide the number of executors depending upon your need
on memory for each executor.
To be specific to your use case, look into your total memory available in each machine and the spark-conf(/etc/spark/conf/spark-defaults.conf) for these two parameters and adjust accordingly.
spark.executor.memory
spark.executor.cores

Partitioning the RDD for Spark Jobs

When I submit job spark in yarn cluster I see spark-UI I get 4 stages of jobs but, memory used is very low in all nodes and it says 0 out of 4 gb used. I guess that might be because I left it in default partition.
Files size ranges are betweenr 1 mb to 100 mb in s3. There are around 2700 files with size of 26 GB. And exactly same 2700 jobs were running in stage 2.
Is it worth to repartition something around 640 partitons, would it improve the performace? or
It doesn't matter if partition is granular than actually required? or
My submit parameters needs to be addressed?
Cluster details,
Cluster with 10 nodes
Overall memory 500 GB
Overall vCores 64
--excutor-memory 16 g
--num-executors 16
--executor-cores 1
Actually it runs on 17 cores out of 64. I dont want to increase the number of cores since others might use the cluster.
You partition, and repartition for following reasons:
To make sure we have enough work to distribute to the distinct cores in our cluster (nodes * cores_per_node). Obviously we need to tune the number of executors, cores per executor, and memory per executor to make that happen as intended.
To make sure we evenly distribute work: the smaller the partitions, the lesser the chance than one core might have much more work to do than all other cores. Skewed distribution can have a huge effect on total lapse time if the partitions are too big.
To keep partitions in managable sizes. Not to big, and not to small so we dont overtax GC. Also bigger partitions might have issues when we have non-linear O.
To small partitions will create too much process overhead.
As you might have noticed, there will be a goldilocks zone. Testing will help you determine ideal partition size.
Note that it is ok to have much more partitions than we have cores. Queuing partitions to be assigned a task is something that I design for.
Also make sure you configure your spark job properly otherwise:
Make sure you do not have too many executors. One or Very Few executors per node is more than enough. Fewer executors will have less overhead, as they work in shared memory space, and individual tasks are handled by threads instead of processes. There is a huge amount of overhead to starting up a process, but Threads are pretty lightweight.
Tasks need to talk to each other. If they are in the same executor, they can do that in-memory. If they are in different executors (processes), then that happens over a socket (overhead). If that is over multiple nodes, that happens over a traditional network connection (more overhead).
Assign enough memory to your executors. When using Yarn as the scheduler, it will fit the executors by default by their memory, not by the CPU you declare to use.
I do not know what your situation is (you made the node names invisible), but if you only have a single node with 15 cores, then 16 executors do not make sense. Instead, set it up with One executor, and 16 cores per executor.

Resources