Spark Repartition and Coalesce - apache-spark

If I want to repartition a dataframe, How to decide on the number of partitions that need to be made? How to decide on whether to use repartition or coalesce?
I understand that coalesce is basically used only to reduce the number of partitions. But how can we decide which to use in what scenario?

we can't decide this based on specific parameter there will be multiple factors are there to decide how many partitions and repartition or coalesce
*based on the size of data ,
if size of the file is too big you can give 2 or 3 partitions per block to
increase the performance but if give more too many partitions it split as
small files .In Big data small files will lower performance .
1 Block (128 MB) --> 128/2 = 64MB each partition ,So 1 mapper will run for 64 MB
*based on the cluster size , if you have more number of executors/cores are free you can give according to that.
*repartition will cause the complete shuffling and coalesce will avoid the complete shuffle.

Related

Is it possible to coalesce Spark partitions "evenly"?

Suppose we have a PySpark dataframe with data spread evenly across 2048 partitions, and we want to coalesce to 32 partitions to write the data back to HDFS. Using coalesce is nice for this because it does not require an expensive shuffle.
But one of the downsides of coalesce is that it typically results in an uneven distribution of data across the new partitions. I assume that this is because the original partition IDs are hashed to the new partition ID space, and the number of collisions is random.
However, in principle it should be possible to coalesce evenly, so that the first 64 partitions from the original dataframe are sent to the first partition of the new dataframe, the next 64 are send to the second partition, and so end, resulting in an even distribution of partitions. The resulting dataframe would often be more suitable for further computations.
Is this possible, while preventing a shuffle?
I can force the relationship I would like between initial and final partitions using a trick like in this question, but Spark doesn't know that everything from each original partition is going to a particular new partition. Thus it can't optimize away the shuffle, and it runs much slower than coalesce.
In your case you can safely coalesce the 2048 partitions into 32 and assume that Spark is going to evenly assign the upstream partitions to the coalesced ones (64 for each in your case).
Here is an extract from the Scaladoc of RDD#coalesce:
This results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
Consider that also how your partitions are physically spread across the cluster influence the way in which coalescing happens. The following is an extract from CoalescedRDD's ScalaDoc:
If there is no locality information (no preferredLocations) in the parent, then the coalescing is very simple: chunk parents that are close in the Array in chunks.
If there is locality information, it proceeds to pack them with the following four goals:
(1) Balance the groups so they roughly have the same number of parent partitions
(2) Achieve locality per partition, i.e. find one machine which most parent partitions prefer
(3) Be efficient, i.e. O(n) algorithm for n parent partitions (problem is likely NP-hard)
(4) Balance preferred machines, i.e. avoid as much as possible picking the same preferred machine

When should I repartition an RDD?

I know that I can repartition an RDD to increase its partitions and use coalesce to decrease its partitions. I have two questions regarding this that I cannot completely understand after reading different resources.
Spark will use a sensible default (1 partition per block which is 64MB in first versions and now 128MB) when generating an RDD. But I also read that it is recommended to use 2 or 3 times the number of cores running the jobs. So here comes the question:
How many partitions should I use for a given file? For example, suppose I have a 10GB .parquet file, 3 executors with 2 cores and 3gb memory each.
Should I repartition? How many partitions should I use? What is the better way to make that choice?
Are all data types (ie .txt, .parquet, etc..) repartitioned by default if no partitioning is provided?
Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster.
For example :
val rdd= sc.textFile ("file.txt", 5)
The above line of code will create an RDD named textFile with 5 partitions.
Suppose that you have a cluster with 4 cores and assume that each partition needs to process for 5 minutes. In case of the above RDD with 5 partitions, 4 partition processes will run in parallel as there are 4 cores and the 5th partition process will process after 5 minutes when one of the 4 cores, is free.
The entire processing will be completed in 10 minutes and during the 5th partition process, the resources (remaining 3 cores) will remain idle.
The best way to decide on the number of partitions in a RDD is to make the number of partitions equal to the number of cores in the cluster so that all the
partitions will process in parallel and the resources will be utilized in an optimal way.
Question : Are all data types (ie .txt, .parquet, etc..) repartitioned
by default if no partitioning is provided?
There will be default no of partitions for every rdd.
to check you can use rdd.partitions.length right after rdd created.
to use existing cluster resources in optimal way and to speed up, we have to consider re-partitioning to ensure that all cores are utilized and all partitions have enough number of records which are uniformly distributed.
For better understanding, also have a look at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html
Note : There is no fixed formula for this. general convention most of them follow is
(numOf executors * no of cores) * replicationfactor (which may be 2 or 3 times more )

Apache Spark running out of memory with smaller amount of partitions

I have an Spark application that keeps running out of memory, the cluster has two nodes with around 30G of RAM, and the input data size is about few hundreds of GBs.
The application is a Spark SQL job, it reads data from HDFS and create a table and cache it, then do some Spark SQL queries and writes the result back to HDFS.
Initially I split the data into 64 partitions and I got OOM, then I was able to fix the memory issue by using 1024 partitions. But why using more partitions helped me solve the OOM issue?
The solution to big data is partition(divide and conquer). Since not all data could be fit into the memory, and it also could not be processed in a single machine.
Each partition could fit into memory and processed(map) in relative short time. After the data is processed for each partition. It need be merged (reduce). This is tradition map reduce
Splitting data to more partitions means that each partition getting smaller.
[Edit]
Spark using revolution concept called Resilient Distributed DataSet(RDD).
There are two types of operations, transformation and acton
Transformations are mapping from one RDD to another. It is lazy evaluated. Those RDD could be treated as intermediate result we don't wanna get.
Actions is used when you really want get the data. Those RDD/data could be treated as what we want it, like take top failing.
Spark will analysed all the operation and create a DAG(Directed Acyclic Graph) before execution.
Spark start compute from source RDD when actions are fired. Then forget it.
(source: cloudera.com)
I made a small screencast for a presentation on Youtube Spark Makes Big Data Sparking.
Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data". The issue with large
partitions generating OOM
Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.
Less partitions results in
Less concurrency,
Increase memory pressure for transformation which involves shuffle
More susceptible for data skew.
Many partitions might also have negative impact
Too much time spent in scheduling multiple tasks
Storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration When reading HDFS files with spark, the number of DataFrame partitions df.rdd.getNumPartitions depends on following properties
spark.default.parallelism (Cores available for the application)
spark.sql.files.maxPartitionBytes (default 128MB)
spark.sql.files.openCostInBytes (default 4MB)
Links :
https://spark.apache.org/docs/latest/tuning.html
https://databricks.com/session/a-deeper-understanding-of-spark-internals
https://spark.apache.org/faq.html
During Spark Summit Aaron Davidson gave some tips about partitions tuning. He also defined a reasonable number of partitions resumed to below 3 points:
Commonly between 100 and 10000 partitions (note: two below points are more reliable because the "commonly" depends here on the sizes of dataset and the cluster)
lower bound = at least 2*the number of cores in the cluster
upper bound = task must finish within 100 ms
Rockie's answer is right, but he does't get the point of your question.
When you cache an RDD, all of his partitions are persisted (in term of storage level) - respecting spark.memory.fraction and spark.memory.storageFraction properties.
Besides that, in an certain moment Spark can automatically drop's out some partitions of memory (or you can do this manually for entire RDD with RDD.unpersist()), according with documentation.
Thus, as you have more partitions, Spark is storing fewer partitions in LRU so that they are not causing OOM (this may have negative impact too, like the need to re-cache partitions).
Another importante point is that when you write result back to HDFS using X partitions, then you have X tasks for all your data - take all the data size and divide by X, this is the memory for each task, that are executed on each (virtual) core. So, that's not difficult to see that X = 64 lead to OOM, but X = 1024 not.

How partitions are created in spark RDD

Let's say I am reading a file from HDFS using spark(scala). A HDFS block size is 64 MB.
Assume , the size of HDFS file is 130 MB.
I would like to know how many partitions are created in base RDD
scala> val distFile = sc.textFile("hdfs://user/cloudera/data.txt")
Is it true that no. of partitions are decided based on block size?
In the above case the no. of partitions is 3?
Here is a good article that describes the partition computation logic for input.
The HDFS block size is the maximum size of a partition. So in your example the minimum number of partitions will be 3.
partitions = ceiling(input size/block size)
You can further increase the number of partitions by passing that as a parameter to sc.textFile as in sc.textFile(inputPath,numPartitions)
Also another setting mapreduce.input.fileinputformat.split.minsize plays a role. You can set it to increase the size of partitions (and reduce the number of partitions). So if you set mapreduce.input.fileinputformat.split.minsize to say 130MB then you will only get 1 partition.
you can run and check number of partitions
distFile.partitions.size

Maximum size of rows in Spark jobs using Avro/Parquet

I am planning to use Spark to process data where each individual element/row in the RDD or DataFrame may occasionally be large (up to several GB).
The data will probably be stored in Avro files in HDFS.
Obviously, each executor must have enough RAM to hold one of these "fat rows" in memory, and some to spare.
But are there other limitations on row size for Spark/HDFS or for the common serialisation formats (Avro, Parquet, Sequence File...)? For example, can individual entries/rows in these formats be much larger than the HDFS block size?
I am aware of published limitations for HBase and Cassandra, but not Spark...
There are currently some fundamental limitations related to block size, both for partitions in use and for shuffle blocks - both are limited to 2GB, which is the maximum size of a ByteBuffer (because it takes an int index, so is limited to Integer.MAX_VALUE bytes).
The maximum size of an individual row will normally need to be much smaller than the maximum block size, because each partition will normally contain many rows, and the largest rows might not be evenly distributed among partitions - if by chance a partition contains an unusually large number of big rows, this may push it over the 2GB limit, crashing the job.
See:
Why does Spark RDD partition has 2GB limit for HDFS?
Related Jira tickets for these Spark issues:
https://issues.apache.org/jira/browse/SPARK-1476
https://issues.apache.org/jira/browse/SPARK-5928
https://issues.apache.org/jira/browse/SPARK-6235

Resources