spark : HDFS blocks vs Cluster cores vs rdd Partitions - apache-spark

I do have a doubt on spark : HDFS blocks vs Cluster cores vs rdd Partitions.
Assume I am trying to process a file in HDFS (say block size is 64 MB and file is 6400 MB). So Ideally it do have 100 splits.
My cluster do have 200 cores in total , and I submitted the jobs with 25 Executors with 4 cores each (means 100 parallel tasks can run).
In nutshell I do have 100 partition by default in rdd and 100 cores will run.
Is this a good approach , or should I repartition the data to 200 partition and use all core in cluster ?

Since you have 200 cores in total, using all of them can improve the performance depending on what kind of workload you are running.
Configure your spark application to use 50 executor (i.e. all 200 cores can be used by Spark). Also Change your spark split size from 64 MB to 32 MB. This will make sure that 6400 MB file will be divided into 200 RDD partitions and so your entire cluster can be used by it.
Don't use repartition - it will be slow as it involves shuffle.

Related

Spark Shuffle partition - if I have shuffle partition less than number of cores what would happen?

I am using databricks with Azure, so I don't have a way to provide the number of executors and memory per executors.
Let's consider I have the following configuration.
10 Worker nodes, each with 4 cores and 10 GB of memory.
it's a standalone configuration
input read size is 100 GB
now if I set my shuffle partition to 10, (less than total cores, 40). What would happen?
will it create total of 10 executors, one per node, with each executor occupying all the cores and all the memory?
If you don't use dynamic allocation, you will end up leaving most cores unused during execution. Think about you have 40 "slots" for computation available, but only 10 tasks to process, so 30 "slots" will be empty (just idle).
I have to add that the above is a very simplified situation. In reality, you can have multiple stages running in parallel, so depending on your query, you will still have all 40 cores utilized (see e.g. Does stages in an application run parallel in spark?)
Note also that spark.sql.shuffle.partitions is not the only parameter which determines the number of tasks/partitions. You can have different number of partitions for
reading files
if you modify your query using repartition, e.g. when using :
df
.repartition(100,$"key")
.groupBy($"key").count
your value of spark.sql.shuffle.partitions=10 will be overwritten by 100 in this exchange step
What your discribing as an expectation is named dynamic allocation on Spark. You can provide min and max allocation and then depending on amount of partiton the framework will be scaled. https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
But with only 10 partition on a 100 gb file you will have outOfMemoryErrors

Can reduced parallelism lead to no shuffle spill?

Consider an example:
I have a cluster with 5 nodes and each node has 64 cores with 244 GB memory.
I decide to run 3 executors on each node and set executor-cores to 21 and executor memory of 80GB, so that each executor can execute 21 tasks in parallel. Now consider that 315(63 * 5) partitions of data, out of which 314 partitions are of size 3GB but one of them is 30GB(due to data skew).
All of the executors that received the 3GB partitions have 63GB(21 * 3 = since each executor can run 21 tasks in parallel and each task takes 3GB of memory space) occupied.
But the one executor that received the 30GB partition will need 90GB(20 * 3 + 30) memory. So will this executor first execute the 20 tasks of 3GB and then load 30GB task or will it just try to load 21 tasks and find that for one task it has to spill to disk? If I set executor-cores to just 15 then the executor that receives the 30 GB partition will only need 14 * 3 + 30 = 72 gb and hence won't spill to disk.
So in this case will reduced parallelism lead to no shuffle spill?
#Venkat Dabri ,
Could you please format the questions with appropriate carriage return/spaces ?
Here are few pointers
Spark (Shuffle)Map Stage ==> the size of each partition depends on filesystem's block size. E.g. if data is read from HDFS , each partition will try to have data as close as 128MB so for input data number of partitions = floor(number of files * blocksize/128 (actually 122.07 as Mebibyte is used))
Now the scenario you are describing is for Shuffled data in Reducer(Result Stage)
Here the blocks processed by reducer tasks are called Shuffled Blocks and By default Spark ( for SQL/Core APIs) will launch 200 reducer tasks
Now important thing to remember Spark can hold Max 2GB so if you have too few partitions and one of them does a remote fetch of a shuffle block > 2GB, you will see an error like Size exceeds Integer.MAX_VALUE
To mitigate that , within default limit Spark employs many optimization (compression/tungsten-sort-shuffle etc) but as a developer we can try to repartition skewed data intelligently and tune default parallelism

Spark behavior on native file system

We are experimenting to run Spark in our project without Hadoop and no distributed storage like HDFS. Spark is installed on a single node with 10 Cores and 16GB RAM and this node is not part of any cluster. Assuming Spark driver takes 2 cores and the rest of them are consumed by executors(2 each) at the time of execution.
If we process a big CSV file (of size 1 GB) stored in local disk in Spark as RDD and repartition it to 4 different partitions, will executors process each partition in parallel?
What would executors do if we don't repartition the RDD to 4 diff partitions?
Do we loose the power of distributed computing and parallelism if dont use HDFS?
Spark caps the maximum size of a partition at 2G, so you should be able to process the entire data with minimal partitioning and quicker processing time. You can set spark.executor.cores to 8 so as to utilize all you resources.
Ideally, you should set the number of partitions depending on the size of your data, and you are better off setting the number of partitions as a multiple of cores/executors.
To answer your question, setting number of partitions to 4 in your case will probably result in each partition being sent to an executor. So yes, each partition will be processed in parallel.
If you don't repartition, then Spark will do it for you depending on the data and split the load between the executors.
Spark works perfectly fine without Hadoop. You might see a negligible performance drop since your files are on the local filesystem and not on HDFS, but for a file of size 1GB it really doesn't matter.

EMR Cluster utilization

I have a 20 mode c4.4xlarge cluster to run a spark job. Each node is a 16 vCore, 30 GiB memory, EBS only storage EBS Storage:32 GiB machine.
Since each node has 16 vCore, I understand that maximum number of executors are 16*20 > 320 executors. Total memory available is 20(#nodes)*30 ~ 600GB. Assigning 1/3rd to system operations, I have 400 GB of Memory to process my data in-memory. Is this the right understanding.
Also, Spark History shows non-uniform distribution of input and shuffle. I believe the processing is not distributed evenly across executors. I pass these config parameters in my spark-submit -
> —-conf spark.dynamicAllocation.enabled=true  —-conf spark.dynamicAllocation.minExecutors=20
Executor summary from spark history UI also shows that data distribution load is completely skewed, and I am not using the cluster in the best way. How can I distribute my load in a better way -

How Apache Spark partitions data of a big file [duplicate]

This question already has answers here:
How does Spark partition(ing) work on files in HDFS?
(4 answers)
Closed 4 years ago.
Let's say I have a cluster of 4 nodes each having 1 core. I have a 600 Petabytes size big file which I want to process through Spark. File could be stored in HDFS.
I think that way to determine no. of partitions is file size / total no. of cores in the cluster. If that is the case indeed, I will have 4 partitions(600/4) so each partition will be of 125 PB size.
But I think 125 PB is too big a size for partition so is my thinking correct related to deducing no. of partitions.
PS: I have just started with Apache Spark. So, apologies if this is a naive question.
As you are storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration. (Lets assume 128 MB Blocks.)
So 600 petabytes will result in 4687500000 blocks of 128 MB each. (600 petabytes/128 MB)
Now when you run your Spark job, each executor will read few blocks of data (number of blocks will be equal to the number of cores in executor) and process them in parallel.
Basically, each core will process 1 partition. So the more cores you give to an executor the more data it can process, but at the same time you will need to allocate more memory to executor to handle the size of data loaded in memory.
It is advised to have moderate size executors. Having too many small executors will cause a lot of data shuffle.
Now coming to your scenario, if you have a 4 node cluster with 1 core each. You will have 3 executors running on them at max as 1 core will be taken for spark driver.
So to process the data, you will be able to process 3 partitions in parallel.
so it will take your job 4687500000/3 = 1562500000 iteration to process the whole data.
Hope that helps!
Cheers!
To answer your question, if you have stored file in HDFS it is already partitioned based on your HDFS configuration i.e. if block size is 64MB, your total file will be divided in such blocks and spread across Hadoop cluster. Spark will generate tasks according to your num.executors configuration to decide how many parallel tasks can be executed. Expect no_of_hdfs_blocks=no_of_total_tasks.
Next what matters is how you are processing logic on this data, are you doing any shuffling of data, something similar to repartition(*) which will move the data around the cluster and change partition number to be processed by your spark job.
HTH!

Resources