We are currently facing wide Partitions for certain customers. We have data in Gbs for those partitions. We tried data modelling and partitions always seems to be skewed. We were trying to use bucketing logic to minimise the partition size. Either large number of partitions are generated for low resource consuming users or Wide partitions are generated for high resource consumers.
Large number of partitions lead to heap memory bloat while wide partitions lead to slower reads.
How can i solve a situation like this ?
First of all I'm working with PySpark on Glue and I'm reading several very large CSV files. Those CSVs are bzip2 compressed and inflated several GB large.
At this stage of processing I'm only performing a simple map over all rows. No joins, group bys, filtering. Just a map.
Let's say I am working on 10 nodes. Generally speaking, would it be preferable to have a rather high number of partitions or a rather low number?
I would guess that independent of available cores on all those nodes that number should be pretty high to make sure that every executor is busy at all times having small chunks of data available.
So, let's say there are 20 cores on those 10 nodes and let's for a second assume there are key-based partitions then something larger than 40 would likely not be a good idea. But in the key-agnostic mapping case I'd tend to something like 1000 partitions or more.
Does that make sense? I'm especially interested in the thought process here.
Although Cassandra allows -2^63 to +2^63-1 number of paritions, is there a recommended max number of partitions beyond which performance might suffer?
After about 1 billion partitions per node full repairs (non incremental) begin to have pretty serious issues with over streaming. Particularly with smaller partitions as the validation compactions run slower.
Ideally i would recommend it by partition size not count. Somewhere around 100mb partitions and you will have more efficient compactions without too much of the expensive overhead of the partition index on reads. I wouldn't be too strict on it though as its very hand wavey on a lot of factors. Try to focus on modeling for your queries first then fine tune it if the said model ends up having too large or too many too small partitions (hundreds of millions or more sub 1k or any multi gb ~ish -- per node not total)
I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.
I read some comments which says than a good number of partition for a RDD is 2-3 time the number of core. I have 8 nodes each with two 12-cores processor, so i have 192 cores, i setup the partition beetween 384-576 but it doesn't seems works efficiently, i tried 8 partition, same result. Maybe i have to setup other parameters in order to my job works better on the cluster rather than on my machine. I add that the file i analyse make 150k lines.
val data = sc.textFile("/img.csv",384)
The primary effect would be by specifying too few partitions or far too many partitions.
Too few partitions You will not utilize all of the cores available in the cluster.
Too many partitions There will be excessive overhead in managing many small tasks.
Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.
Now, considering your case, you are getting the same results from 8 and 384-576 partitions. Generally the thumb rule says,
NoOfPartitions = (NumberOfWorkerNodes*NoOfCoresPerWorkerNode)-1
It says that, as we know, the task is processed by CPU cores. So we should set that many number of partitions which is the total number of cores in the cluster to process-1(for Application Master of driver). That means the each core will process each partition at a time.
That means with 191 partitions can improve the performance. Otherwise impact of setting less and more partitions scenario is explained in beginnning.
Hope this will help!!!