I saw the following post a little bit back: Understanding TreeReduce in Spark
I am still trying to exactly understand when to use a treeReduce vs a reduceByKey. I think we can use a universal example like a word count to help me further understand what is going on.
Does it always make sense to use reduceByKey in a word count?
Or is there a particular size of data when treeReduce makes more sense?
Are there particular cases or rules of thumbs when treeReduce is the better option?
Also this may be answered in the above based on reduceByKey but does anything change with reduceByKeyLocally and treeReduce
How do I appropriately determine depth?
Edit: So playing in spark-shell, I think I fundamentally don't understand the concept of treeReduce but hopefully an example and those question help.
res2: Array[(String, Int)] = Array((D,1), (18964,1), (D,1), (1,1), ("",1), ("",1), ("",1), ("",1), ("",1), (1,1))
scala> val reduce = input.reduceByKey(_+_)
reduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[11] at reduceByKey at <console>:25
scala> val tree = input.treeReduce(_+_, 2)
<console>:25: error: type mismatch;
found : (String, Int)
required: String
val tree = input.treeReduce(_+_, 2)
There is a fundamental difference between the two-reduceByKey is only available on key-value pair RDDs, while treeReduce is a generalization of reduce operation on any RDD. reduceByKey is used for implementing treeReduce but they are not related in any other sense.
reduceByKey performs reduction per each key, resulting in an RDD; it is not an "action" in RDD sense but a transformation that returns a ShuffleRDD. This is equivalent to groupByKey followed by a map that does key-wise reduction (check this why using groupByKey is inefficient).
On the other hand, treeAggregate is a generalization of reduce function, inspired from AllReduce. This is an "action" in spark sense, returning the result on the master node. As explained the link posted in your question, after performing local reduce operation, reduce performs rest of the computation on the master, which can be very burdensome (especially in machine learning when the reduce function results in a large vectors or a matrices). Instead, treeReduce perform the reduction in parallel using reduceByKey (this is done by creating a key-value pair RDD on the fly, with the keys determined by the depth of the tree; check implementation here).
So, to answer your first two questions, you have to use reduceByKey for word count since you are interested in getting per word-count and treeReduce is not appropriate here. The other two questions are not related to this topic.
Related
I am reading Spark's source code, I find in its shuffle implementation, during shuffle reading, when BlockStoreShuffleReader.read is called, it will firstly use a ExternalAppendOnlyMap to aggregate
def combineValuesByKey(
iter: Iterator[_ <: Product2[K, V]],
context: TaskContext): Iterator[(K, C)] = {
val combiners = new ExternalAppendOnlyMap[K, V, C](createCombiner, mergeValue, mergeCombiners)
combiners.insertAll(iter)
updateMetrics(context, combiners)
combiners.iterator
}
then, it will use a ExternalSorter to sort and aggregate. So there will be lots of disk spill/read work here.
val resultIter = dep.keyOrdering match {
case Some(keyOrd: Ordering[K]) =>
// Create an ExternalSorter to sort the data.
val sorter =
new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = dep.serializer)
...
My question is why we need both ExternalSorter and ExternalAppendOnlyMap? Is it possible we combine these two into one?
I mean their codes look quite similar, why can't we all use ExternalSorter rather then ExternalAppendOnlyMap? Since it can both aggregate and sort?
DISCLAIMER I'm only now exploring this part of Spark Core so my understanding may be entirely incorrect.
My understanding is that ExternalAppendOnlyMap is simply a spillable size-tracking append-only map while ExternalSorter can be a buffer or a map (based on map-side combine flag for map-side partial values).
Is it possible we combine these two into one?
With that I think that they share quite a lot, and ExternalSorter seems more flexible (as it can do what ExternalAppendOnlyMap does).
I think the answer to your question is "Yes", but very few people are brave or encouraged enough to implement the changes.
I recently played around with UDAFs and looked into the sourcecode of the built-in aggregation function collect_list, I was suprised to see that collect_list does not have a merge method implemented, although I think this is really straight-farward (just concatenate two Arrays). Code taken from org.apache.spark.sql.catalyst.expressions.aggregate.collect.Collect
override def merge(buffer: InternalRow, input: InternalRow): Unit = {
sys.error("Collect cannot be used in partial aggregations.")
}
It is no longer the case, as SPARK-1893 but I'd assume that the initial design had mostly collect_list in mind.
Because collect_list is logically equivalent to groupByKey the motivation would be exactly the same to avoid long GC pauses. In particular map side combine in groupByKey has been disabled with Spark SPARK-772:
Map side combine in group by key case does not reduce the amount of data shuffled. Instead, it forces a lot more objects to go into old gen, and leads to worse GC.
So to address you comment
I think this is really straight-farward (just concatenate two Arrays).
It might be simple but it doesn't add much value (unless there is another reducing operation on top of it) and sequence concatenation is expensive.
What is the difference between reduce and reduceByKey in Apache Spark in terms of their functionalities?
Why reduceByKey is a transformation and reduce is an action?
This is close to a duplicate of my answer explaining reduceByKey, but I will elaborate to the specific part that makes the two different. However refer to my answer for a bit more specifics on the internals of reduceByKey.
Basically, reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.
Note, however that there is a reduceByKeyLocally you can use to automatically pull down the Map to a single location also.
Please go through this official documentation link .
reduce is an action which Aggregate the elements of the dataset using a function func (which takes two arguments and returns one),also we can use reduce for single RDDs (for more info Please click HERE).
reduceByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. (for more info Please click HERE)
this is the qt assistant :
reduce(f): Reduces the elements of this RDD using the specified
commutative and associative binary operator. Currently reduces
partitions locally.
reduceByKey(func, numPartitions=None, partitionFunc=) :
Merge the values for each key using an associative and commutative reduce
function.
These three Apache Spark Transformations are little confusing. Is there any way I can determine when to use which one and when to avoid one?
I think official guide explains it well enough.
I will highlight differences (you have RDD of type (K, V)):
if you need to keep the values, then use groupByKey
if you no need to keep the values, but you need to get some aggregated info about each group (items of the original RDD, which have the same K), you have two choices: reduceByKey or aggregateByKey (reduceByKey is kind of particular aggregateByKey)
2.1 if you can provide an operation which take as an input (V, V) and returns V, so that all the values of the group can be reduced to the one single value of the same type, then use reduceByKey. As a result you will have RDD of the same (K, V) type.
2.2 if you can not provide this aggregation operation, then use aggregateByKey. It happens when you reduce values to another type. So you will have (K, V2) as a result.
In addition to #Hlib answer, I would like to add few more points.
groupByKey() is just to group your dataset based on a key.
reduceByKey() is something like grouping + aggregation. We can say reduceBykey() equvelent to dataset.group(...).reduce(...).
aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output.
For DataFrame, it is easy to generate a new column with some operation using a udf with df.withColumn("newCol", myUDF("someCol")). To do something like this in Dataset, I guess I would be using the map function:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
You have to pass the entire case class T as input to the function. If the Dataset[T] has a lot of fields/columns, it would seem very inefficient to be passing the entire row if you just wanted to make one extra column by operating on one of the many columns of T. My question is, is Catalyst smart enough to be able to optimize this?
Is Catalyst smart enough to be able to optimize this?
tl;dr No. See SPARK-14083 Analyze JVM bytecode and turn closures into Catalyst expressions.
There's currently no way Spark SQL's Catalyst Optimizer know what you do in your Scala code.
Quoting SPARK-14083:
One big advantage of the Dataset API is the type safety, at the cost of performance due to heavy reliance on user-defined closures/lambdas. These closures are typically slower than expressions because we have more flexibility to optimize expressions (known data types, no virtual function calls, etc). In many cases, it's actually not going to be very difficult to look into the byte code of these closures and figure out what they are trying to do. If we can understand them, then we can turn them directly into Catalyst expressions for more optimized executions.
And there's even your case mentioned:
df.map(_.name) // equivalent to expression col("name")
As you can see it's still open and I doubt anyone works on this currently.
What you could do to help Spark Optimizer is to select that one column and only then use map operator with a one-argument UDF.
That would certainly match your requirements of not passing the entire JVM object to your function, but would not get rid of this slow deserialization from an internal row representation to your Scala object (that would land on the JVM and occupy some space until a GC happens).
I tried to figure myself since I could not find a response anywhere.
Let's have a dataset which contains case classes with multiple fields:
scala> case class A(x: Int, y: Int)
scala> val dfA = spark.createDataset[A](Seq(A(1, 2)))
scala> val dfX = dfA.map(_.x)
Now if we check the optimized plan we get the following:
scala> val plan = dfX.queryExecution.optimizedPlan
SerializeFromObject [input[0, int, true] AS value#8]
+- MapElements <function1>, obj#7: int
+- DeserializeToObject newInstance(class A), obj#6: A
+- LocalRelation [x#2, y#3]
According to the more verbose plan.toJSON the DeserializeToObject step assumes both x and y to be present.
As you proof take for example the following snippet which uses reflection instead of directly touching the fields of A which still works.
val dfX = dfA.map(
_.getClass.getMethods.find(_.getName == "x").get.invoke(x).asInstanceOf[Int]
)