Are the more partitions the better in Spark? - apache-spark

I am new in Spark and have a question
Are the more partitions the better in Spark ? If I have OOM issue, more partitions helps ?

Partitions determine the degree of parallelism.
Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
In case of very few partitions, all the cores in the cluster would not be utilized.
If there are too many partitions and data is small, then too many small tasks gets scheduled.
If your getting the out of memory issue, you would have to increase the executor memory. It should be a minimum of 8GB.

Related

Apache Spark: How many partitions can a executor hold in spark.? How are the partitions distributed (mechanism) among the executors?

I am interested to know the following nitty gritties of spark parallelism and Partitioning
How many partitions can a executor hold in spark?
How are the partitions distributed (mechanism) among the executors?
How to set the size of the partition. Would like to know the relevant the config parameter.
Does executor store all the partitions in memory? If not when spilled to disk does it spill entire partition to disk or a part of partition to disk?
5 When there are 2 cores per executor but there are 5 partition in that executor then
Not quite the right way to look at it. An Executor holds nothing, it just does work.
A Partition is processed by a Core that has been assigned to an Executor. An Executor typically has 1 core but can have more than 1 such Core.
An App has Actions that translate to 1 or more Jobs.
A Job has Stages (based on Shuffle Boundaries).
Stages have Tasks, the number of these depends on number of Partitions.
Parallel processing of the Partitions depends on number of Cores allocated to Executors.
Spark is scalable in terms of Cores, Memory and Disk. The latter two in relation to your questions means that if the Partitions cannot all fit into Memory on the Worker for your Job, then that Partition or more will spill in its entirety to Disk.

Spark behavior on native file system

We are experimenting to run Spark in our project without Hadoop and no distributed storage like HDFS. Spark is installed on a single node with 10 Cores and 16GB RAM and this node is not part of any cluster. Assuming Spark driver takes 2 cores and the rest of them are consumed by executors(2 each) at the time of execution.
If we process a big CSV file (of size 1 GB) stored in local disk in Spark as RDD and repartition it to 4 different partitions, will executors process each partition in parallel?
What would executors do if we don't repartition the RDD to 4 diff partitions?
Do we loose the power of distributed computing and parallelism if dont use HDFS?
Spark caps the maximum size of a partition at 2G, so you should be able to process the entire data with minimal partitioning and quicker processing time. You can set spark.executor.cores to 8 so as to utilize all you resources.
Ideally, you should set the number of partitions depending on the size of your data, and you are better off setting the number of partitions as a multiple of cores/executors.
To answer your question, setting number of partitions to 4 in your case will probably result in each partition being sent to an executor. So yes, each partition will be processed in parallel.
If you don't repartition, then Spark will do it for you depending on the data and split the load between the executors.
Spark works perfectly fine without Hadoop. You might see a negligible performance drop since your files are on the local filesystem and not on HDFS, but for a file of size 1GB it really doesn't matter.

Is my understanding of spark partitioning correct?

I'd like to know If my understanding of the partitioning in Spark is correct.
I always thought about the number of partitions and their size and never about the worker they were processed by.
Yesterday, as I was playing a bit with partitioning, I found out that I was able to track the cached partitions' location using the WEB UI (Storage -> Cached RDD -> Data Distribution) and it surprised me.
I have a cluster of 30 cores (3 cores * 10 executors) and I had a RDD of like 10 partitions. I tried to expand it to 100 partitions to increase the parallelism just to find out that almost 90% of the partitions were on the same worker node and thus my parallelism was not limited by the total number of cpu in my cluster but by the number of cpu of the node containing 90% of the partitions.
I tried to find answers on stackoverflow and the only answer I could come by was about data locality. Spark detected that most of my files were on this node so it decided to keep most of the partitions on this node.
Is my understanding correct?
And if it is, is there a way to tell Spark to really shuffle the data?
So far this "data locality" lead to heavy underutilization of my cluster....

Spark performance tuning - number of executors vs number for cores

I have two questions around performance tuning in Spark:
I understand one of the key things for controlling parallelism in the spark job is the number of partitions that exist in the RDD that is being processed, and then controlling the executors and cores processing these partitions. Can I assume this to be true:
# of executors * # of executor cores shoud be <= # of partitions. i.e to say one partition is always processed in one core of one executor. There is no point having more executors*cores than the number of partitions
I understand that having a high number of cores per executor can have a -ve impact on things like HDFS writes, but here's my second question, purely from a data processing point of view what is the difference between the two? For e.g. if I have 10 node cluster what would be the difference between these two jobs (assuming there's ample memory per node to process everything):
5 executors * 2 executor cores
2 executors * 5 executor cores
Assuming there's infinite memory and CPU, from a performance point of view should we expect the above two to perform the same?
Most of the time using larger executors (more memory, more cores) are better. One: larger executor with large memory can easily support broadcast joins and do away with shuffle. Second: since tasks are not created equal, statistically larger executors have better chance of surviving OOM issues.
The only problem with large executors is GC pauses. G1GC helps.
In my experience, if I had a cluster with 10 nodes, I would go for 20 spark executors. The details of the job matter a lot, so some testing will help determine the optional configuration.

Max limit on number of RDDs in Spark

I am new to Spark and is currently learning related concepts. One basic question I have is what is the limit on number of RDDs in Spark ?
Up to my knowledge, there is no limit on number of RDDs in Spark.
Only limitation is with memory and disk swap space Spark can expand/use.
I have been running spark over data of 40 tb in production, and yet to reach any kind of limit.

Resources