In Learning Spark you can find the section "Pseudo set operations". There, it is, rightfully, stated that RDDs are not sets from a mathematical point of view. This is obviously correct; e.g. elements in an RDD are not unique. One is tempted to argue that RDDs are multisets. It raises two questions:
Are RDDs multisets or not?
In turn:
If RDDs are not multisets, why? What is the difference between RDDs and multisets?
If RDDs are multisets, why the (multi)set-operations are not defined accordingly? For example If M1 = [a,b,a,b] and M2 = [a,a,b,c], then from a mathematical point of view their intersection should be [a,a,b]. However, Spark returns [a,b]; kind of the purely set view of the operation. What is the motivation behind?
N.B.: Cartesian product: As if to add to the confusion, the cartesian product of M1 and M2 behaves like a multiset product; i.e. returns a multiset.
Related
What is the difference between reduce and reduceByKey in Apache Spark in terms of their functionalities?
Why reduceByKey is a transformation and reduce is an action?
This is close to a duplicate of my answer explaining reduceByKey, but I will elaborate to the specific part that makes the two different. However refer to my answer for a bit more specifics on the internals of reduceByKey.
Basically, reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.
Note, however that there is a reduceByKeyLocally you can use to automatically pull down the Map to a single location also.
Please go through this official documentation link .
reduce is an action which Aggregate the elements of the dataset using a function func (which takes two arguments and returns one),also we can use reduce for single RDDs (for more info Please click HERE).
reduceByKey When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. (for more info Please click HERE)
this is the qt assistant :
reduce(f): Reduces the elements of this RDD using the specified
commutative and associative binary operator. Currently reduces
partitions locally.
reduceByKey(func, numPartitions=None, partitionFunc=) :
Merge the values for each key using an associative and commutative reduce
function.
These three Apache Spark Transformations are little confusing. Is there any way I can determine when to use which one and when to avoid one?
I think official guide explains it well enough.
I will highlight differences (you have RDD of type (K, V)):
if you need to keep the values, then use groupByKey
if you no need to keep the values, but you need to get some aggregated info about each group (items of the original RDD, which have the same K), you have two choices: reduceByKey or aggregateByKey (reduceByKey is kind of particular aggregateByKey)
2.1 if you can provide an operation which take as an input (V, V) and returns V, so that all the values of the group can be reduced to the one single value of the same type, then use reduceByKey. As a result you will have RDD of the same (K, V) type.
2.2 if you can not provide this aggregation operation, then use aggregateByKey. It happens when you reduce values to another type. So you will have (K, V2) as a result.
In addition to #Hlib answer, I would like to add few more points.
groupByKey() is just to group your dataset based on a key.
reduceByKey() is something like grouping + aggregation. We can say reduceBykey() equvelent to dataset.group(...).reduce(...).
aggregateByKey() is logically same as reduceByKey() but it lets you return result in different type. In another words, it lets you have a input as type x and aggregate result as type y. For example (1,2),(1,4) as input and (1,"six") as output.
I saw the following post a little bit back: Understanding TreeReduce in Spark
I am still trying to exactly understand when to use a treeReduce vs a reduceByKey. I think we can use a universal example like a word count to help me further understand what is going on.
Does it always make sense to use reduceByKey in a word count?
Or is there a particular size of data when treeReduce makes more sense?
Are there particular cases or rules of thumbs when treeReduce is the better option?
Also this may be answered in the above based on reduceByKey but does anything change with reduceByKeyLocally and treeReduce
How do I appropriately determine depth?
Edit: So playing in spark-shell, I think I fundamentally don't understand the concept of treeReduce but hopefully an example and those question help.
res2: Array[(String, Int)] = Array((D,1), (18964,1), (D,1), (1,1), ("",1), ("",1), ("",1), ("",1), ("",1), (1,1))
scala> val reduce = input.reduceByKey(_+_)
reduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[11] at reduceByKey at <console>:25
scala> val tree = input.treeReduce(_+_, 2)
<console>:25: error: type mismatch;
found : (String, Int)
required: String
val tree = input.treeReduce(_+_, 2)
There is a fundamental difference between the two-reduceByKey is only available on key-value pair RDDs, while treeReduce is a generalization of reduce operation on any RDD. reduceByKey is used for implementing treeReduce but they are not related in any other sense.
reduceByKey performs reduction per each key, resulting in an RDD; it is not an "action" in RDD sense but a transformation that returns a ShuffleRDD. This is equivalent to groupByKey followed by a map that does key-wise reduction (check this why using groupByKey is inefficient).
On the other hand, treeAggregate is a generalization of reduce function, inspired from AllReduce. This is an "action" in spark sense, returning the result on the master node. As explained the link posted in your question, after performing local reduce operation, reduce performs rest of the computation on the master, which can be very burdensome (especially in machine learning when the reduce function results in a large vectors or a matrices). Instead, treeReduce perform the reduction in parallel using reduceByKey (this is done by creating a key-value pair RDD on the fly, with the keys determined by the depth of the tree; check implementation here).
So, to answer your first two questions, you have to use reduceByKey for word count since you are interested in getting per word-count and treeReduce is not appropriate here. The other two questions are not related to this topic.
I just discovered the RDD.zip() method and I cannot imagine what its contract could possibly be.
I understand what it does, of course. However, it has always been my understanding that
the order of elements in an RDD is a meaningless concept
the number of partitions and their sizes is an implementation detail only available to the user for performance tuning
In other words, an RDD is a (multi)set, not a sequence (and, of course, in, e.g., Python one gets AttributeError: 'set' object has no attribute 'zip')
What is wrong with my understanding above?
What was the rationale behind this method?
Is it legal outside the trivial context like a.map(f).zip(a)?
EDIT 1:
Another crazy method is zipWithIndex(), as well as well as the various zipPartitions() variants.
Note that first() and take() are not crazy because they are just (non-random) samples of the RDD.
collect() is also okay - it just converts a set to a sequence which is perfectly legit.
EDIT 2: The reply says:
when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.
This appears to imply that even the trivial a.map(f).zip(a) is not guaranteed to be equivalent to a.map(x => (f(x),x)). What is the situation when zip() results are reproducible?
It is not true that RDDs are always unordered. An RDD has a guaranteed order if it is the result of a sortBy operation, for example. An RDD is not a set; it can contain duplicates. Partitioning is not opaque to the caller, and can be controlled and queried. Many operations do preserve both partitioning and order, like map. That said I find it a little easy to accidentally violate the assumptions that zip depends on, since they're a little subtle, but it certainly has a purpose.
The mental model I use (and recommend) is that the elements of an RDD are ordered, but when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.
For those who want to be aware of partitions, I'd say that:
The partitions of an RDD have an order.
The elements within a partition have an order.
If you think of "concatenating" the partitions (say laying them "end to end" in order) using the order of elements within them, the overall ordering you end up with corresponds to the order of elements if you ignore partitions.
But again, if you compute one RDD from another, all bets about the order relationships of the two RDDs are off.
Several members of the RDD class (I'm referring to the Scala API) strongly suggest an order concept (as does their documentation):
collect()
first()
partitions
take()
zipWithIndex()
as does Partition.index as well as SparkContext.parallelize() and SparkContext.makeRDD() (which both take a Seq[T]).
In my experience these ways of "observing" order give results that are consistent with each other, and the ones that translate back and forth between RDDs and ordered Scala collections behave as you would expect -- they preserve the overall order of elements. This is why I say that, in practice, RDDs have a meaningful order concept.
Furthermore, while there are obviously many situations where computing an RDD from another must change the order, in my experience order tends to be preserved where it is possible/reasonable to do so. Operations that don't re-partition and don't fundamentally change the set of elements especially tend to preserve order.
But this brings me to your question about "contract", and indeed the documentation has a problem in this regard. I have not seen a single place where an operation's effect on element order is made clear. (The OrderedRDDFunctions class doesn't count, because it refers to an ordering based on the data, which may differ from the raw order of elements within the RDD. Likewise the RangePartitioner class.) I can see how this might lead you to conclude that there is no concept of element order, but the examples I've given above make that model unsatisfying to me.
Brand new to Apache Spark and I'm a little confused how to make updates to a value that sits outside of a .mapTriplets iteration in GraphX. See below:
def mapTripletsMethod(edgeWeights: Graph[Int, Double], stationaryDistribution: Graph[Double, Double]) = {
val tempMatrix: SparseDoubleMatrix2D = graphToSparseMatrix(edgeWeights)
stationaryDistribution.mapTriplets{ e =>
val row = e.srcId.toInt
val column = e.dstId.toInt
var cellValue = -1 * tempMatrix.get(row, column) + e.dstAttr
tempMatrix.set(row, column, cellValue) // this doesn't do anything to tempMatrix
e
}
}
I'm guessing this is due to the design of an RDD and there's no simple way to update the tempMatrix value. When I run the above code the tempMatrix.set method does nothing. It was rather difficult to try to follow the problem in the debugger.
Does anyone have an easy solution? Thank you!
Edit
I've made an update above to show that stationaryDistribution is a graph RDD.
You could make tempMatrix be of type RDD[((Int,Int), Double)] -- that is, each entry is a pair where the first element is in turn a (row,col) pair. Then use the PairRDDFunctions class to combine that with ((row,col),weight) triplets generated by your mapTriplets call. (So, don't think of it as updating the tempMatrix, but rather combining two RDDs to get a third.)
If you need to support stationary distribution graphs where there is more than one edge per vertex pair it gets a little tricky: you'll probably need to combine those edges in a reduction pass to create an RDD with one entry per pair, with a list of weights, and then apply all the weights to a given (row,col) pair at the same time. Otherwise it's very simple.
Notice that `PairRDDFunctions' on the one hand give you ways to combine multiple RDDs into one, or on the other hand to pull the values out into a Map on the master. Assuming that the distribution matrix is large enough to merit an RDD in the first place, I think you should do the whole thing on RDDs.
Another approach is to make the tempMatrix be a GraphRDD too, which may or may not make sense depending on what you're going to do with it next.