Is my understanding of spark partitioning correct? - apache-spark

I'd like to know If my understanding of the partitioning in Spark is correct.
I always thought about the number of partitions and their size and never about the worker they were processed by.
Yesterday, as I was playing a bit with partitioning, I found out that I was able to track the cached partitions' location using the WEB UI (Storage -> Cached RDD -> Data Distribution) and it surprised me.
I have a cluster of 30 cores (3 cores * 10 executors) and I had a RDD of like 10 partitions. I tried to expand it to 100 partitions to increase the parallelism just to find out that almost 90% of the partitions were on the same worker node and thus my parallelism was not limited by the total number of cpu in my cluster but by the number of cpu of the node containing 90% of the partitions.
I tried to find answers on stackoverflow and the only answer I could come by was about data locality. Spark detected that most of my files were on this node so it decided to keep most of the partitions on this node.
Is my understanding correct?
And if it is, is there a way to tell Spark to really shuffle the data?
So far this "data locality" lead to heavy underutilization of my cluster....

Related

How do you efficiently bucket/partition on a shared cluster that autoscales?

Edit: Using Spark with Databricks
As far as I understand, effective partitioning should be based on the number of executors available, ideally partitions % executors = 0
But if you work on a shared Spark cluster that autoscales according to activity, and in which people may be keeping some executors busy with their own work, is it possible to efficiently partition and bucket in this way?
Say I notice there are 8 exectutors active on the cluster, so I make 8 partitions or buckets to distribute the workload more easily. While that's happening, Alice and Jane log on and start running big queries, so the cluster upscales to say, 12 executors.
Now I'm no longer efficiently parititioned. Or what if the cluster doesn't upscale, but Alice and Jane take up some executors, now my partitions will be skewed, right?
Or... will Spark recognise that I have 8 partitions, and upscale as needed to match that if enough aren't immediately available?
The rule partitions % executors = 0 is applied to the efficient processing so you don't have less partitions than executors at some point of time. Really, the things are more complicated - partitions could be small, and then automatically coalesced when Adaptive Query Execution (AQE) kicks in, combining multiple small partitions into bigger logical partitions, etc. And it's one of the "optimizations" on Spark 3.x - set shuffle partitions to some big number, and allow AQE to optimize it, instead of ending with too big partitions.
Yes, on shared cluster, some of resources could be consumed by other users, but that's just will allocate less cores for your processing, but not skew your partitions. Skewed partitions are primarily related to the partitions of different sizes, but this also should be handled by AQE that is enabled on DBR 7.3+.
Overall: yes, on shared clusters some resources will be taken by other users, but otherwise it's better to rely on the improvements in the Spark 3.x in area of automatic optimization. In previous versions there was a lot of manual tuning that isn't required in newer versions.

decide no of partition in spark (running on YARN) based on executer ,cores and memory

How to decide no of partition in spark (running on YARN) based on executer, cores and memory.
As i am new to spark so doesn't have much hands on real scenario
I know many things to consider to decide the partition but still any production general scenario explanation in detail will be very helpful.
Thanks in advance
One important parameter for parallel collections is the number of
partitions to cut the dataset into. Spark will run one task for each
partition of the cluster. Typically you want 2-4 partitions for each
CPU in your cluster
the number of parition is recommended to be 2/4 * the number of cores.
so if you have 7 executor with 5 core , you can repartition between 7*5*2 = 70 and 7*5*4 = 140 partition
https://spark.apache.org/docs/latest/rdd-programming-guide.html
IMO with spark 3.0 and AWS EMR 2.4.x with adaptive query execution you're often better off letting spark handle it. If you do want to hand tune it the answer can often times be complicated. One good option is to have 2 or 4 times the number of cpus available. While this is useful for most datasizes it becomes problematic with very large and very small datasets. In those cases it's useful to aim for ~128MB per partition.

What performance parameters to set for spark scala code to run on yarn using spark-submit?

My use case is to merge two tables where one table contains 30 million records with 200 cols and another table contains 1 million records with 200 cols.I am using broadcast join for small table.I am loading both the tables as data-frames from hive managed tables on HDFS.
I need the values to set for driver memory and executor memory and other parameters along with it for this use case.
I have this hardware configurations for my yarn cluster :
Spark Version 2.0.0
Hdp version 2.5.3.0-37
1) yarn clients 20
2) Max. virtual cores allocated for a container (yarn.scheduler.maximum.allocation-vcores) 19
3) Max. Memory allocated for a yarn container 216gb
4) Cluster Memory Available 3.1 TB available
Any other info you need I can provide for this cluster.
I have to decrease the time to complete this process.
I have been using some configurations but I think its wrong, it took me 4.5 mins to complete it but I think spark has capability to decrease this time.
There are mainly two things to look at when you want to speed up your spark application.
Caching/persistance:
This is not a direct way to speed up the processing. This will be useful when you have multiple actions(reduce, join etc) and you want to avoid the re-computation of the RDDs in the case of failures and hence decrease the application run duration.
Increasing the parallelism:
This is the actual solution to speed up your Spark application. This can be achieved by increasing the number of partitions. Depending on the use case, you might have to increase the partitions
Whenever you create your dataframes/rdds: This is the better way to increase the partitions as you don't have to trigger a costly shuffle operation to increase the partitions.
By calling repartition: This will trigger a shuffle operation.
Note: Once you increase the number of partitions, then increase the executors(may be very large number of small containers with few vcores and few GBs of memory
Increasing the parallelism inside each executor
By adding more cores to each executor, you can increase the parallelism at the partition level. This will also speed up the processing.
To have a better understanding of configurations please refer this post

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

Settle the right number of partition on RDD

I read some comments which says than a good number of partition for a RDD is 2-3 time the number of core. I have 8 nodes each with two 12-cores processor, so i have 192 cores, i setup the partition beetween 384-576 but it doesn't seems works efficiently, i tried 8 partition, same result. Maybe i have to setup other parameters in order to my job works better on the cluster rather than on my machine. I add that the file i analyse make 150k lines.
val data = sc.textFile("/img.csv",384)
The primary effect would be by specifying too few partitions or far too many partitions.
Too few partitions You will not utilize all of the cores available in the cluster.
Too many partitions There will be excessive overhead in managing many small tasks.
Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.
Now, considering your case, you are getting the same results from 8 and 384-576 partitions. Generally the thumb rule says,
NoOfPartitions = (NumberOfWorkerNodes*NoOfCoresPerWorkerNode)-1
It says that, as we know, the task is processed by CPU cores. So we should set that many number of partitions which is the total number of cores in the cluster to process-1(for Application Master of driver). That means the each core will process each partition at a time.
That means with 191 partitions can improve the performance. Otherwise impact of setting less and more partitions scenario is explained in beginnning.
Hope this will help!!!

Resources