Spark RDD groupByKey + join vs join performance - apache-spark

I am using Spark on the cluster which I am sharing with others users. So it is not reliable to tell which one of my code runs more efficient just based on the running time. Because when I am running the more efficient code, someone else maybe running huge data works and makes my code executes for a longer time.
So can I ask 2 questions here:
I was using join function to join 2 RDDsand I am trying to use groupByKey() before using join, like this:
rdd1.groupByKey().join(rdd2)
seems that it took longer time, however I remember when I was using Hadoop Hive, the group by made my query ran faster. Since Spark is using lazy evaluation, I am wondering whether groupByKey before join makes things faster
I have noticed Spark has a SQL module, so far I really don't have time to try it, but can I ask what are the differences between the SQL module and RDD SQL like functions?

There is no good reason for groupByKey followed by join to be faster than join alone. If rdd1 and rdd2 have no partitioner or partitioners differ then a limiting factor is simply shuffling required for HashPartitioning.
By using groupByKey you not only increase a total cost by keeping mutable buffers required for grouping but what is more important you use an additional transformation which results in a more complex DAG. groupByKey + join:
rdd1 = sc.parallelize([("a", 1), ("a", 3), ("b", 2)])
rdd2 = sc.parallelize([("a", 5), ("c", 6), ("b", 7)])
rdd1.groupByKey().join(rdd2)
vs. join alone:
rdd1.join(rdd2)
Finally these two plans are not even equivalent and to get the same results you have to add an additional flatMap to the first one.
This is a quite broad question but to highlight the main differences:
PairwiseRDDs are homogeneous collections of arbitraryTuple2 elements. For default operations you want key to be hashable in a meaningful way otherwise there are no strict requirements regarding the type. In contrast DataFrames exhibit much more dynamic typing but each column can only contain values from a supported set of defined types. It is possible to define UDT but it still has to be expressed using basic ones.
DataFrames use a Catalyst Optimizer which generates logical and physical execution planss and can generate highly optimized queries without need for applying manual low level optimizations. RDD based operations simply follow dependency DAG. It means worse performance without custom optimization but much better control over execution and some potential for fine graded tuning.
Some other things to read:
Difference between DataFrame and RDD in Spark
Why spark.ml don't implement any of spark.mllib algorithms?

I mostly agree with zero323's answer, but I think there is reason to expect join to be faster after groupByKey. groupByKey reduces the amount of data and partitions the data by the key. Both of these help with the performance of a subsequent join.
I don't think the former (reduced data size) is significant. And to reap the benefits of the latter (partitioning) you need to have the other RDD partitioned the same way.
For example:
val a = sc.parallelize((1 to 10).map(_ -> 100)).groupByKey()
val b = sc.parallelize((1 to 10).map(_ -> 100)).partitionBy(a.partitioner.get)
a.join(b).collect

Related

Spark SQL joining multiple tables design

I am developing a Spark SQL analytics solutions using set of tables. Suppose there are 5 tables which i need to building my solution and finally i am creating one output table.
Here is my flow
dataframe1 = table1 join table2
dataframe2 = dataframe1 join table3
dataframe3 = datamframe2 + filter + agg
dataframe4 = dataframe3 join table4 join table 5
// finally
dataframe4.saveAsTable
When I save final dataframe that's when all the above dataframe is evaluated.
Is my approach is good? or
Do i need to cache/persist intermediate dataframes?
This is a very generic question and it is hard to provide a definitive answer.
Depending on the size of tables you would want to do broadcast hint for any of tables that are relatively small.
You can do this via
table_i.join(broadcast(table_j), ....)
This behaviour depends on the value in:
Now broadcast hint will be honoured only if Spark is able to evaluate the value of the table so you might need to cache().
Another option is via Spark checkpoints that can help to truncate local plan for optimisation (also this allows you to resume jobs from checkpoint location, it is similar to writing to HDFS but with some overhead).
In case of broadcasting few houndres of Mb tables, you might need to increase your kryo buffer:
--conf spark.kryoserializer.buffer.max=1g
It also depends which join types you will use.
You would probably want to do filter and aggregagtion as early as possible since it will reduce the join surface.
There are many other considerations to be consider in order to properly optimise this. In case of power law distribution of join keys in any of the joins you would need to do salting and explode smaller table.
In your case, in principle, there is not really a cache or persist required Why?
As there are no reuse paths evident (for other Actions or other Transformations within the same Action), it is all sequential.
Also, lazy evaluation and Catalyst.
Try the .explain and see how Spark will process.
However, due to memory eviction possibilities on the Cluster, there may be the need to re-compute on a Worker. There are various settings that you could apply via .cache and .persist, but Spark handles memory and disk spills without explicit .cache or .persist. See https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/
Also, using .cache can affect performance. So use .explain. See here an excellent posting: Spark: Explicit caching can interfere with Catalyst optimizer's ability to optimize some queries?
So, each case is different but yours seems Ok to answer as I have. In summary: An RDD or DF that is not cached, nor check-pointed, is re-evaluated again each time an Action is invoked on that RDD or DF or if re-accessed within the current Action and no skipped stage situation applies. In your case no issue. Doing otherwise would slow your App down in fact.

How to choose between join(broadcast) and collect with Spark

I'm using Spark 2.2.1.
I have a small DataFrame (less than 1M) and I have a computation on a big DataFrame that will need this small one to compute a column in an UDF.
What is the best option regarding performance
Is it better to broadcast this DF (I don't know if Spark will do the cartesian into memory).
bigDF.crossJoin(broadcast(smallDF))
.withColumn(udf("$colFromSmall", $"colFromBig"))
or to collect it and use the small value directly in the udf
val small = smallDF.collect()
bigDF.withColumn(udf($"colFromBig"))
Both will collect data first, so in terms of memory footprint there is no difference. So the choice should be dictated by the logic:
If you can do better than default execution plan and don't want to create your own, udf might be a better approach.
If it is just a Cartesian, and requires subsequent explode - perish the though - just go with the former option.
As suggested in the comments by T.Gawęda in the second case you can use broadcast
val small = spark.spark.broadcast(smallDF.collect())
bigDF.withColumn(udf($"colFromBig"))
It might provide some performance improvements if udf is reused.

How to reduce Spark task counts & avoid group by

All, I am using PySpark & need to join two RDD's but to join them both I need to group all elements of each RDD by the joining key and later perform a join function. This causes additional overheads and I am not sure what a work around can be. Also this is creating a high number of tasks that is in turn increasing the number of files to write to HDFS and slowing overall process by a lot here is a example:
RDD1 = [join_col,{All_Elements of RDD1}] #derived by using groupby join_col)
RDD2 = [join_col,{All_Elements of RDD2}] #derived by using groupby join_col)
RDD3 = RDD1.join(RDD2)
If desired output is grouped and both RDDs are to large to be broadcasted there is not much you can do at the code level. It could be cleaner to simply apply cogroup:
rdd1.cogroup(rdd2)
but there should be no significant difference performance-wise. If you suspect there can be a large data / hash skew you can try different partitioning, for example by using sortByKey but it is unlikely to help you in a general case.

'map-side' aggregation in Spark

I am learning spark using the book 'Learning Spark'. Came across this term(Page 54)
We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it I am confused what is meant by map-side aggregation here?. The only thing that comes to my mind is Mapper & Reducer in Hadoop MapReduce...but believe that is in no way related to Spark.
Idea behind using map-side aggregations is pretty much the same as Hadoop combiners. If a single mapper can yield multiple values for the same key you can reduce shuffling by reducing values locally.
One example of operation which can benefit from map-side aggregation is creating set of value for each key, especially when you partition a RDD before combining:
First lets create some dummy data:
val pairs = sc.parallelize(
("foo", 1) :: ("foo", 1) :: ("foo", 2) ::
("bar", 3) :: ("bar", 4) :: ("bar", 5) :: Nil
)
And merge data using combineByKey:
import collection.mutable.{Set => MSet}
val combined = partitionedPairs.combineByKey(
(v: Int) => MSet[Int](v),
(set: MSet[Int], v: Int) => set += v,
(set1: MSet[Int], set2: MSet[Int]) => set1 ++= set2
)
Depending on the data distribution this can significantly reduce network traffic. Overall
reduceByKey,
combineByKey with mapSideCombine set to true
aggregateByKey
foldByKey
will use map side aggregations, while groupByKey and combineByKey with mapSideCombine set to false won't.
The choice however between applying map side aggregations or not is not always obvious. Cost of maintaining required data structures and subsequent garbage collection can in many cases exceed cost of shuffle.
You're right, the term map-side reduce does come from the Map/Reduce land and the idea is a bit complicated in the Apache Spark side of things. If it's possible that we could combine multiple elements within a partition before shuffling the elements (and the combined elements took up less space) - then performing a per-partition reduction prior to shuffling the data would be useful.
One case where map-side reduction is disabled in Spark is with groupByKey even if we can combine some of the elements in the same partition, they will take up about the same amount of space anyways so there is no corresponding reduction in network/serialization work.
Hope that helps and glad you are reading Learning Spark :)
For example, you cannot use map side aggregation (combiner), if you group values by key (groupByKey operation does not use combiner). The reason is that all values for each key should be present after groupByKey operation is finished. Thus, local reduction of values (combiner) is not possible.

Operations and methods to be careful about in Apache Spark?

What operations and/or methods do I need to be careful about in Apache Spark? I've heard you should be careful about:
groupByKey
collectAsMap
Why?
Are there other methods?
There're what you could call 'expensive' operations in Spark: all those that require a shuffle (data reorganization) fall in this category. Checking for the presence of ShuffleRDD on the result of rdd.toDebugString give those away.
If you mean "careful" as "with the potential of causing problems", some operations in Spark will cause memory-related issues when used without care:
groupByKey requires that all values falling under one key to fit in memory in one executor. This means that large datasets grouped with low-cardinality keys have the potential to crash the execution of the job. (think allTweets.keyBy(_.date.dayOfTheWeek).groupByKey -> bumm)
favor the use of aggregateByKey or reduceByKey to apply map-side reduction before collecting values for a key.
collect materializes the RDD (forces computation) and sends the all the data to the driver. (think allTweets.collect -> bumm)
If you want to trigger the computation of an rdd, favor the use of rdd.count
To check the data of your rdd, use bounded operations like rdd.first (first element) or rdd.take(n) for n elements
If you really need to do collect, use rdd.filter or rdd.reduce to reduce its cardinality
collectAsMap is just collect behind the scenes
cartesian: creates the product of one RDD with another, potentially creating a very large RDD. oneKRdd.cartesian(onekRdd).count = 1000000
consider adding keys and join in order to combine 2 rdds.
others?
In general, having an idea of the volume of data flowing through the stages of a Spark job and what each operation will do with it will help you keep mentally sane.

Resources