How to create partitions for two dataframes while couple of partitions can be located on the same instance/machine on Spark? - apache-spark

We have two DataFrames: df_A, df_B
Let's say, both has a huge # of rows. And we need to partition them.
How to partition them as couples?
For example, partition number is 5:
df_A partitions: partA_1, partA_2, partA_3, partA_4, partA_5
df_B partitions: partB_1, partB_2, partB_3, partB_4, partB_5
If we have 5 machines:
machine_1: partA_1 and partB_1
machine_2: partA_2 and partB_2
machine_3: partA_3 and partB_3
machine_4: partA_4 and partB_4
machine_5: partA_5 and partB_5
If we have 3 machine:
machine_1: partA_1 and partB_1
machine_2: partA_2 and partB_2
machine_3: partA_3 and partB_3
...(when machines are free up)...
machine_1: partA_4 and partB_4
machine_2: partA_5 and partB_5
Note: If one of DataFrames is small enough, we can use broadcast technique.
What to do(how to partition) when both (or more than two) DataFrames are large enough?

I think we need to take a step back here. Looking at big sizes aspect only, not broadcast.
Spark is a framework that manages things for your App in terms of co-location of dataframe partitions, taking into account resources allocated vs. resources available and the type of Action, and thus if Workers need to acquire partitions for processing.
repartitions are Transformations. When an Action, such as write:
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
occurs then things kick in.
If you have a JOIN, then Spark will work out if re-partitioning and movement is required.
That is to say, if you join on c1 for both DF's, then re-partitioning may most likely well occur for the c1 column, so that occurrences in the DF's for that c1 column are shuffled to the same Nodes where a free Executor resides waiting to serve that JOIN of 2 or more partitions.
That only occurs when an Action is invoked. In this way, if you do unnecessary Transformation, Catalyst can obviate those things.
Also, for number of partitions used, this is a good link imho: spark.sql.shuffle.partitions of 200 default partitions conundrum

Related

What is the difference between spark.shuffle.partition and spark.repartition in spark?

What I understand is
When we repartition any dataframe with value n, data will continue to remain on those n partitions, until you hit any shuffle stages or other value of repartition or coalesce.
For Shuffle, it only comes into the play when you hit any shuffle stages and data will continue to remain on those partitions until you hit coalesce or repartition.
I am right ?
If yes then, can any one point out a striking difference?
TLDR - Repartition is invoked as per developer's need but shuffle is done when there is a logical demand
I assume you're talking about config property spark.sql.shuffle.partitions and method .repartition.
As data distribution is an important aspect in any distributed environment, which not only governs parallelism but can also create adverse impacts if the distribution is uneven. However, repartitioning itself is a costly operation as it involves heavy movement of data (i.e. Shuffling). The .repartition method is used to explicitly repartition the data into new partitions - meaning to increase or decrease the number of partitions in the program based on your need. You can invoke this whenever you want.
As opposed to this, spark.sql.shuffle.partitions is a configuration property that governs the number of partitions created when a data movement happens as a result of operations like aggregations and joins.
Configures the number of partitions to use when shuffling data for
joins or aggregations.
When you're performing transformations other than join or aggregation, the above configuration won't have any impact on the number of partitions the new Dataframe will have.
Your confusion between the two is due to both operations involving shuffling. While that is true, the former (i.e. repartition) is an explicit operation where the user is dictating the framework to increase or decrease the number of partitions - which in turn causes shuffling, while in case of joins/aggregation - the shuffling is caused by the operation itself.
Basically -
Joins/Aggregations cause shuffling which causes repartitioning
repartition is asked thus, shuffling has to be done
Another method coalesce make the difference clearer.
For reference, coalesce is a variant of repartition which can only lower the number of partitions, not necessarily equal in size. As it already knows the number of partitions are only to be decreased, it can perform it with minimal shuffling (just join two adjacent partitions until the number is met).
Consider your dataframe has 4 partitions but has data only in 2 of them, thus you decide to reduce the number of partitions to 2. When using coalesce spark tries to achieve this without shuffling or with minimal shuffling.
df.rdd().getNumPartitions(); // Returns 4 with size 0, 0, 2, 4
df=df.coalesce(2); // Decrease partitions to 2
df.rdd().getNumPartitions(); // Returns 2 now with size 2, 4
So there was no shuffling involved. While the following
df1.rdd().getNumPartitions() // Returns 4
df2.rdd().getNumPartitions() // Returns 8
df1.join(df2).rdd().getNumPartitions() // Returns 200
As you've performed a join it'll always return the number of partitions based on spark.sql.shuffle.partitions

Splitting spark data into partitions and writing those partitions to disk in parallel

Problem outline: Say I have 300+ GB of data being processed with spark on an EMR cluster in AWS. This data has three attributes used to partition on the filesystem for use in Hive: date, hour, and (let's say) anotherAttr. I want to write this data to a fs in such a way that minimizes the number of files written.
What I'm doing right now is getting the distinct combinations of date, hour, anotherAttr, and a count of how many rows make up combination. I collect them into a List on the driver, and iterate over the list, building a new DataFrame for each combination, repartitioning that DataFrame using the number of rows to guestimate file size, and writing the files to disk with DataFrameWriter, .orc finishing it off.
We aren't using Parquet for organizational reasons.
This method works reasonably well, and solves the problem that downstream teams using Hive instead of Spark don't see performance issues resulting from a high number of files. For example, if I take the whole 300 GB DataFrame, do a repartition with 1000 partitions (in spark) and the relevant columns, and dumped it to disk, it all dumps in parallel, and finishes in ~9 min with the whole thing. But that gets up to 1000 files for the larger partitions, and that destroys Hive performance. Or it destroys some kind of performance, honestly not 100% sure what. I've just been asked to keep the file count as low as possible. With the method I'm using, I can keep the files to whatever size I want (relatively close anyway), but there is no parallelism and it takes ~45 min to run, mostly waiting on file writes.
It seems to me that since there's a 1-to-1 relationship between some source row and some destination row, and that since I can organize the data into non-overlapping "folders" (partitions for Hive), I should be able to organize my code/DataFrames in such a way that I can ask spark to write all the destination files in parallel. Does anyone have suggestions for how to attack this?
Things I've tested that did not work:
Using a scala parallel collection to kick off the writes. Whatever spark was doing with the DataFrames, it didn't separate out the tasks very well and some machines were getting massive garbage collection problems.
DataFrame.map - I tried to map across a DataFrame of the unique combinations, and kickoff writes from inside there, but there's no access to the DataFrame of the data that I actually need from within that map - the DataFrame reference is null on the executor.
DataFrame.mapPartitions - a non-starter, couldn't come up with any ideas for doing what I want from inside mapPartitions
The word 'partition' is also not especially helpful here because it refers both to the concept of spark splitting up the data by some criteria, and to the way that the data will be organized on disk for Hive. I think I was pretty clear in the usages above. So if I'm imagining a perfect solution to this problem, it's that I can create one DataFrame that has 1000 partitions based on the three attributes for fast querying, then from that create another collection of DataFrames, each one having exactly one unique combination of those attributes, repartitioned (in spark, but for Hive) with the number of partitions appropriate to the size of the data it contains. Most of the DataFrames will have 1 partition, a few will have up to 10. The files should be ~3 GB, and our EMR cluster has more RAM than that for each executor, so we shouldn't see a performance hit from these "large" partitions.
Once that list of DataFrames is created and each one is repartitioned, I could ask spark to write them all to disk in parallel.
Is something like this possible in spark?
One thing I'm conceptually unclear on: say I have
val x = spark.sql("select * from source")
and
val y = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr")
and
val z = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr2")
To what extent is y is a different DataFrame than z? If I repartition y, what effect does the shuffle have on z, and on x for that matter?
We had the same problem (almost) and we ended up by working directly with RDD (instead of DataFrames) and implementing our own partitioning mechanism (by extending org.apache.spark.Partitioner)
Details: we are reading JSON messages from Kafka. The JSON should be grouped by customerid/date/more fields and written in Hadoop using Parquet format, without creating too many small files.
The steps are (simplified version):
a)Read the messages from Kafka and transform them to a structure of RDD[(GroupBy, Message)]. GroupBy is a case class containing all the fields that are used for grouping.
b)Use a reduceByKeyLocally transformation and obtain a map of metrics (no of messages/messages size/etc) for each group - eg Map[GroupBy, GroupByMetrics]
c)Create a GroupPartitioner that's using the previously collected metrics (and some input parameters like the desired Parquet size etc) to compute how many partitions should be created for each GroupBy object. Basically we are extending org.apache.spark.Partitioner and overriding numPartitions and getPartition(key: Any)
d)we partition the RDD from a) using the previously defined partitioner: newPartitionedRdd = rdd.partitionBy(ourCustomGroupByPartitioner)
e)Invoke spark.sparkContext.runJob with two parameters: the first one is the RDD partitioned at d), the second one is a custom function (func: (TaskContext, Iterator[T]) that will write the messages taken from Iterator[T] into Hadoop/Parquet
Let's say that we have 100 mil messages, grouped like that
Group1 - 2 mil
Group2 - 80 mil
Group3 - 18 mil
and we decided that we have to use 1.5 mil messages per partition to obtain Parquet files greater than 500MB. We'll end up with 2 partitions for Group1, 54 for Group2, 12 for Group3.
This statement:
I collect them into a List on the driver, and iterate over the list,
building a new DataFrame for each combination, repartitioning that
DataFrame using the number of rows to guestimate file size, and
writing the files to disk with DataFrameWriter, .orc finishing it off.
is completely off-beam where Spark is concerned. Collecting to driver is never a good approach, volumes and OOM issues and latency in your approach is high.
Use so the below so as to simplify and get parallelism of Spark benefits saving time and money for your boss:
df.repartition(cols...)...write.partitionBy(cols...)...
shuffle occurs via repartition, no shuffling ever with partitionBy.
That simple, with Spark's default parallelism utilized.

Is it possible to coalesce Spark partitions "evenly"?

Suppose we have a PySpark dataframe with data spread evenly across 2048 partitions, and we want to coalesce to 32 partitions to write the data back to HDFS. Using coalesce is nice for this because it does not require an expensive shuffle.
But one of the downsides of coalesce is that it typically results in an uneven distribution of data across the new partitions. I assume that this is because the original partition IDs are hashed to the new partition ID space, and the number of collisions is random.
However, in principle it should be possible to coalesce evenly, so that the first 64 partitions from the original dataframe are sent to the first partition of the new dataframe, the next 64 are send to the second partition, and so end, resulting in an even distribution of partitions. The resulting dataframe would often be more suitable for further computations.
Is this possible, while preventing a shuffle?
I can force the relationship I would like between initial and final partitions using a trick like in this question, but Spark doesn't know that everything from each original partition is going to a particular new partition. Thus it can't optimize away the shuffle, and it runs much slower than coalesce.
In your case you can safely coalesce the 2048 partitions into 32 and assume that Spark is going to evenly assign the upstream partitions to the coalesced ones (64 for each in your case).
Here is an extract from the Scaladoc of RDD#coalesce:
This results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
Consider that also how your partitions are physically spread across the cluster influence the way in which coalescing happens. The following is an extract from CoalescedRDD's ScalaDoc:
If there is no locality information (no preferredLocations) in the parent, then the coalescing is very simple: chunk parents that are close in the Array in chunks.
If there is locality information, it proceeds to pack them with the following four goals:
(1) Balance the groups so they roughly have the same number of parent partitions
(2) Achieve locality per partition, i.e. find one machine which most parent partitions prefer
(3) Be efficient, i.e. O(n) algorithm for n parent partitions (problem is likely NP-hard)
(4) Balance preferred machines, i.e. avoid as much as possible picking the same preferred machine

Understanding Shuffle and rePartitioning in spark

I would greatly appreciate if someone could answer these few spark shuffle related questions in simplified terms .
In spark, when loading a data-set ,we specify the number of partitions, which tells how many block the input data(RDD) should be divided into ,and based on the number of partitions, equal number of tasks are launched (correct me, if the assumption is wrong).for X number of cores in worker node.corresponding X number of task run at one time.
Along similar lines ,here are the few questions.
Since,All byKey operations along with coalesce, repartition,join and cogroup, causes data shuffle.
Is data shuffle another name for repartitiong operation?
What happens to the initial partitions(number of partitions declared)when repartitions happens.
Can someone give example(explain) how data movement across the cluster happens.i have seen couple of examples where
random arrow movement of keys is shown (but dont know how the movement is being driven),for example if we have already have data in 10 partitions,does the re partitioning operation combine all data first ,and then send the particular key to the particular partition based on the hash-code%numberofpartitions.
First of all, HDFS blocks is divided into number of partition not in the blocks. These petitions resides in the work of memory. These partitions resides in the worker memory.
Q- Is data shuffle another name for repartitiong operation?
A- No. Generally repartition means increasing the existing partition in which the data is divided into into. So whenever we increase the partition, we are actually trying to “move” the data in number of new partitions set in code not “Shuffling” . Shuffling is somewhat when we move the data of particular key in one partition.
Q- What happens to the initial partitions(number of partitions declared)when repartitions happens?
A- Covered above
One more underlying thing is rdd.repartition(n) will not do change the no. Of partitions of rdd, its a tranformation, which will work when some other rdd is created like
rdd1=rdd.repartition(n)
Now it will create new rdd1 that have n number of partition.To do this, we can call coalesce function like rdd.coalesce(n) Being an action function, this will change the partitions of rdd itself.
Q- Can someone give example(explain) how data movement across across the cluster happens.i have seen couple of examples where random arrow movement of keys is shown (but dont know how the movement is being driven),for example if we have already have data in 10 partitions,does the re partitioning operation combine all data first ,and then send the particular key to the particular partition based on the hash-code%numberofpartitions.
Ans- partition and partitioning at two different different concept so partition is something in which the data is divided evenly in the number of partitions set by the user but in partitioning, data is shuffled among those partitions according to algorithms set by user like HashPartitioning & RangePartitioning.
Like rdd= sc.textFile(“../path”,5) rdd.partitions.size/length
O/p: Int: 5(No.of partitions)
rdd.partitioner.isDefined
O/p: Boolean= false
rdd.partitioner
O/p: None(partitioning scheme)
But,
rdd=sc.textFile(“../path”,5).partitionBy(new org.apache.spark.HashPartition(10).cache()
rdd.partitions.size
O/p: Int: 10
rdd.partitioner.isDefined
O/p: Boolean: true
rdd.partitioner
O/p: HashPartitioning#
Hope this will help!!!

Managing Spark partitions after DataFrame unions

I have a Spark application that will need to make heavy use of unions whereby I'll be unioning lots of DataFrames together at different times, under different circumstances. I'm trying to make this run as efficiently as I can. I'm still pretty much brand-spanking-new to Spark, and something occurred to me:
If I have DataFrame 'A' (dfA) that has X number of partitions (numAPartitions), and I union that to DataFrame 'B' (dfB) which has Y number of partitions (numBPartitions), then what will the resultant unioned DataFrame (unionedDF) look like, with result to partitions?
// How many partitions will unionedDF have?
// X * Y ?
// Something else?
val unionedDF : DataFrame = dfA.unionAll(dfB)
To me, this seems like its very important to understand, seeing that Spark performance seems to rely heavily on the partitioning strategy employed by DataFrames. So if I'm unioning DataFrames left and right, I need to make sure I'm constantly managing the partitions of the resultant unioned DataFrames.
The only thing I can think of (so as to properly manage partitions of unioned DataFrames) would be to repartition them and then subsequently persist the DataFrames to memory/disk as soon as I union them:
val unionedDF : DataFrame = dfA.unionAll(dfB)
unionedDF.repartition(optimalNumberOfPartitions).persist(StorageLevel.MEMORY_AND_DISK)
This way, as soon as they are unioned, we repartition them so as to spread them over the available workers/executors properly, and then the persist(...) call tells to Spark to not evict the DataFrame from memory, so we can continue working on it.
The problem is, repartitioning sounds expensive, but it may not be as expensive as the alternative (not managing partitions at all). Are there generally-accepted guidelines about how to efficiently manage unions in Spark-land?
Yes, Partitions are important for spark.
I am wondering if you could find that out yourself by calling:
yourResultedRDD.getNumPartitions()
Do I have to persist, post union?
In general, you have to persist/cache an RDD (no matter if it is the result of a union, or a potato :) ), if you are going to use it multiple times. Doing so will prevent spark from fetching it again in memory and can increase the performance of your application by 15%, in some cases!
For example if you are going to use the resulted RDD just once, it would be safe not to do persist it.
Do I have to repartition?
Since you don't care about finding the number of partitions, you can read in my memoryOverhead issue in Spark
about how the number of partitions affects your application.
In general, the more partitions you have, the smaller the chunk of data every executor will process.
Recall that a worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.
Isn't the Dataframe always in memory?
Not really. And that's something really lovely with spark, since when you handle bigdata you don't want unnecessary things to lie in the memory, since this will threaten the safety of your application.
A DataFrame can be stored in temporary files that spark creates for you, and is loaded in the memory of your application only when needed.
For more read: Should I always cache my RDD's and DataFrames?
Union just add up the number of partitions in dataframe 1 and dataframe 2. Both dataframe have same number of columns and same order to perform union operation. So no worries, if partition columns different in both the dataframes, there will be max m + n partitions.
You doesn't need to repartition your dataframe after join, my suggestion is to use coalesce in place of repartition, coalesce combine common partitions or merge some small partitions and avoid/reduce shuffling data within partitions.
If you cache/persist dataframe after each union, you will reduce performance and lineage is not break by cache/persist, in that case, garbage collection will clean cache/memory in case of some heavy memory intensive operation and recomputing will increase computation time for the same, may be this time partial computation is required for clear/removed data.
As spark transformation are lazy, i.e; unionAll is lazy operation and coalesce/repartition is also lazy operation and come in action at the time of first action, so try to coalesce unionall result after an interval like counter of 8 and reduce partition in resulting dataframe. Use checkpoints to break lineage and store data, if there is lots of memory intensive operation in your solution.

Resources