Performance benefits of DataSet over RDD - apache-spark

After reading few great articles (this, this and this) about Spark's DataSets, I finishing with next DataSet's performance benefits over RDD:
Logical and physical plan optimization;
Strict typization;
Vectorized operations;
Low level memory management.
Questions:
Spark's RDD also builds physical plan and can combine/optimize multiple transformations at the same stage. Then what is the benefit of DataSet over RDD?
From the first link you can see an example of RDD[Person]. Does DataSet have advanced typization?
What do they mean by "vectorized operations"?
As I understand, DataSet's low memory management = advanced serialization. That means off-heap storage of serializable objects, where you can read only one field of an object without deserialization. But how about the situation when you have IN_MEMORY_ONLY persistence strategy? Will DataSet serialize everything any case? Will it have any performance benefit over RDD?

Spark's RDD also builds physical plan and can combine/optimize multiple transformations at the same stage. Than what is the benefit of DataSet over RDD?
When working with RDD what you write is what you get. While certain transformations are optimized by chaining, the execution plan is direct translation of the DAG. For example:
rdd.mapPartitions(f).mapPartitions(g).mapPartitions(h).shuffle()
where shuffle is an arbitrary shuffling transformation (*byKey, repartition, etc.) all three mapPartitions (map, flatMap, filter) will be chained without creating intermediate objects but cannot be rearranged.
Compared to that Datasets use significantly more restrictive programming model but can optimize execution using a number of techniques including:
Selection (filter) pushdown. For example if you have:
df.withColumn("foo", col("bar") + 1).where(col("bar").isNotNull())
can be executed as:
df.where(col("bar").isNotNull()).withColumn("foo", col("bar") + 1)
Early projections (select) and eliminations. For example:
df.withColumn("foo", col("bar") + 1).select("foo", "bar")
can be rewritten as:
df.select("foo", "bar").withColumn("foo", col("bar") + 1)
to avoid fetching and passing obsolete data. In the extreme case it can eliminate particular transformation completely:
df.withColumn("foo", col("bar") + 1).select("bar")
can be optimized to
df.select("bar")
These optimizations are possible for two reasons:
Restrictive data model which enables dependency analysis without complex and unreliable static code analysis.
Clear operator semantics. Operators are side effects free and we clearly distinguish between deterministic and nondeterministic ones.
To make it clear let's say we have a following data model:
case class Person(name: String, surname: String, age: Int)
val people: RDD[Person] = ???
And we want to retrieve surnames of all people older than 21. With RDD it can be expressed as:
people
.map(p => (p.surname, p.age)) // f
.filter { case (_, age) => age > 21 } // g
Now let's ask ourselves a few questions:
What is the relationship between the input age in f and age variable with g?
Is f and then g the same as g and then f?
Are f and g side effects free?
While the answer is obvious for a human reader it is not for a hypothetical optimizer. Compared to that with Dataframe version:
people.toDF
.select(col("surname"), col("age")) // f'
.where(col("age") > 21) // g'
the answers are clear for both optimizer and human reader.
This has some further consequences when using statically typed Datasets (Spark 2.0 Dataset vs DataFrame).
Have DataSet got more advanced typization?
No - if you care about optimizations. The most advanced optimizations are limited to Dataset[Row] and at this moment it is not possible to encode complex type hierarchy.
Maybe - if you accept overhead of the Kryo or Java encoders.
What does they mean by "vectorized operations"?
In context of optimization we usually mean loop vectorization / loop unrolling. Spark SQL uses code generation to create compiler friendly version of the high level transformations which can be further optimized to take advantage of the vectorized instruction sets.
As I understand, DataSet's low memory management = advanced serialization.
Not exactly. The biggest advantage of using native allocation is escaping garbage collector loop. Since garbage collections is quite often a limiting factor in Spark this is a huge improvement, especially in contexts which require large data structures (like preparing shuffles).
Another important aspect is columnar storage which enables effective compression (potentially lower memory footprint) and optimized operations on compressed data.
In general you can apply exactly the same types of optimizations using hand crafted code on plain RDDs. After all Datasets are backed by RDDs. The difference is only how much effort it takes.
Hand crafted execution plan optimizations are relatively simple to achieve.
Making code compiler friendly requires some deeper knowledge and is error prone and verbose.
Using sun.misc.Unsafe with native memory allocation is not for the faint-hearted.
Despite all its merits Dataset API is not universal. While certain types of common tasks can benefit from its optimizations in many contexts you may so no improvement whatsoever or even performance degradation compared to RDD equivalent.

Related

reduce, reduceByKey, reduceGroups in Spark or Flink

reduce: function takes accumulated value and next value to find some aggregation.
reduceByKey: is also the same operation with specified key.
reduceGroups: is apply specified operation to the grouped data.
I don't know how memory managed for these operations. For example, how data is taken while using reduce function(e.g all data loaded to the memory?)? I want to know how data is managed for reduce operations. I also want to know what is the difference between these operations according to the data management.
Reduce is one of the cheapest operations in Spark,since that the only thing it does is actually grouping similar data to the same node.The only cost of a reduce operation is the reading of the tuple and a decision of where it should be grouped.
This means that the simple reduce,in contrast to the reduceByKey or reduceGroups is more expensive because Spark does not know how to make the grouping and searches for correlations among tuples.
Reduce can also ignore a tuple if it does not meet any requirement.

Spark: aggregate versus map and reduce

I'm learning Spark and start understanding how Spark distributes the data and combines the results.
I came to the conclusion that using the operation map followed by reduce has an advantage on using just the operation aggregate. This is (at least I believe so) because aggregate uses a sequential operation, which hurts parallelism, while map and reduce can benefit from full parallelism.
So when having a choice, isn't it better to use map and reduce than aggregate ? Are there cases where aggregate is preferred ? Or maybe when aggregate can't be replaced by the combination map and reduce ?
As an example - I want to find the string with the max length:
val z = sc.parallelize(List("123","12","345","4567"))
// instead of this aggregate ....
z.aggregate(0)((x, y) => math.max(x, y.length), (x, y) => math.max(x, y))
// .... shouldn't I rather use this map - reduce combination ?
z.map(_.length).reduce((x, y) => math.max(x, y))
A little example will can be better than long explanations.
Imagine you have a class Toto with an age field. You have many Toto and you desire to compute sum of ages of every Toto.
final case class Toto(val age: Int)
val rdd = sc.parallelize(0 until n).map(Toto(_))
// map/reduce style
val sum1 = rdd
// O(n) operations to go througth every Toto's age
.map(_.age)
// another O(n) to access data then O(n) operations to sum the n values
.reduce(_ + _)
// You get the result with 2 pass over your data plus O(n) additions
// aggregate style
val sum2 = rdd.aggregate(0)((agg, e) => agg + e.age, _ + _)
// With one pass over the data, and O(n) additions you obtain the same result
It's a bit more complicate if you take into account access and each operations.
Because aggregate still access then sum the age into the aggregate wich represent O(2.n) operations, O(n) access plus O(n) additions, plus negligeable merged operation between aggregates.
On the other side with map/reduce style, first the map represent O(n) access, then again O(n) access to data to reduce them with an overhead of O(n) addition operations for a total of O(3.n) operations.
Without forgetting the fact that Spark is lazy and all of your transformation will be leverage by a final action.
I presume that using aggregate will save some operations and then will improve application running time. But depending on what you're doing it could be more usefull to express successive map followed by a reduce for readability compare to an aggregate or combineByKey (generalization of aggregateByKey). So i will suppose that it depends on which goals you desire to reach depending the use case.
I believe I can partially answer my own question. I was wrongly assuming that, because a sequential operation is used, aggregate might be hurt in its parallelism. The data can still be parallelized and the sequential op will be executed on each chunk. This doesn't seem less performing than the map operation. So then the question that remains is: why would you use aggregate as opposed to the map-reduce combination ?
Aggregate operation allows to specify a combiner function (to reduce the amount of data sent through the shuffle), which is different to reducer, with map-reduce combination the same function is used to combine and reduce. I know used old Map Reduce terminology but conceptually all shared nothing shuffle based frameworks do this and if you google for mapreduce combiner you will find a lot of explanations of the concept.

Why does collect_list in Spark not use partial aggregation

I recently played around with UDAFs and looked into the sourcecode of the built-in aggregation function collect_list, I was suprised to see that collect_list does not have a merge method implemented, although I think this is really straight-farward (just concatenate two Arrays). Code taken from org.apache.spark.sql.catalyst.expressions.aggregate.collect.Collect
override def merge(buffer: InternalRow, input: InternalRow): Unit = {
sys.error("Collect cannot be used in partial aggregations.")
}
It is no longer the case, as SPARK-1893 but I'd assume that the initial design had mostly collect_list in mind.
Because collect_list is logically equivalent to groupByKey the motivation would be exactly the same to avoid long GC pauses. In particular map side combine in groupByKey has been disabled with Spark SPARK-772:
Map side combine in group by key case does not reduce the amount of data shuffled. Instead, it forces a lot more objects to go into old gen, and leads to worse GC.
So to address you comment
I think this is really straight-farward (just concatenate two Arrays).
It might be simple but it doesn't add much value (unless there is another reducing operation on top of it) and sequence concatenation is expensive.

Is it OK to have nodes with mutable attributes when using Spark's GraphX distributed functions?

I am looking at the implementation of a certain graph clustering algorithm using Spark's GraphX graph analytics library. I noticed that the implementation uses a class VertexState with several mutable (var members).
I wonder whether doing this sort of thing could lead to incorrect behaviour, due to the fact that in distributed implementations the same node could be replicated in more than one processing node.
My question is not so much about the correctness of this practice in the context of this particular implementation, but in general.
Perhaps it is fine is one is just using certain functions such as map on the vertex set, but might be problematic if one is using others that involve more than one vertex at a time such as mapReduceTriplets?
Having mutable members is just fine... as long as you don't mutate them. Any type of data mutation in place can result in incorrect or non-deterministic behavior. There are cases when you can use mutable accumulators with aggregations but you should never modify data stored in a distributed object.

Mind blown: RDD.zip() method

I just discovered the RDD.zip() method and I cannot imagine what its contract could possibly be.
I understand what it does, of course. However, it has always been my understanding that
the order of elements in an RDD is a meaningless concept
the number of partitions and their sizes is an implementation detail only available to the user for performance tuning
In other words, an RDD is a (multi)set, not a sequence (and, of course, in, e.g., Python one gets AttributeError: 'set' object has no attribute 'zip')
What is wrong with my understanding above?
What was the rationale behind this method?
Is it legal outside the trivial context like a.map(f).zip(a)?
EDIT 1:
Another crazy method is zipWithIndex(), as well as well as the various zipPartitions() variants.
Note that first() and take() are not crazy because they are just (non-random) samples of the RDD.
collect() is also okay - it just converts a set to a sequence which is perfectly legit.
EDIT 2: The reply says:
when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.
This appears to imply that even the trivial a.map(f).zip(a) is not guaranteed to be equivalent to a.map(x => (f(x),x)). What is the situation when zip() results are reproducible?
It is not true that RDDs are always unordered. An RDD has a guaranteed order if it is the result of a sortBy operation, for example. An RDD is not a set; it can contain duplicates. Partitioning is not opaque to the caller, and can be controlled and queried. Many operations do preserve both partitioning and order, like map. That said I find it a little easy to accidentally violate the assumptions that zip depends on, since they're a little subtle, but it certainly has a purpose.
The mental model I use (and recommend) is that the elements of an RDD are ordered, but when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.
For those who want to be aware of partitions, I'd say that:
The partitions of an RDD have an order.
The elements within a partition have an order.
If you think of "concatenating" the partitions (say laying them "end to end" in order) using the order of elements within them, the overall ordering you end up with corresponds to the order of elements if you ignore partitions.
But again, if you compute one RDD from another, all bets about the order relationships of the two RDDs are off.
Several members of the RDD class (I'm referring to the Scala API) strongly suggest an order concept (as does their documentation):
collect()
first()
partitions
take()
zipWithIndex()
as does Partition.index as well as SparkContext.parallelize() and SparkContext.makeRDD() (which both take a Seq[T]).
In my experience these ways of "observing" order give results that are consistent with each other, and the ones that translate back and forth between RDDs and ordered Scala collections behave as you would expect -- they preserve the overall order of elements. This is why I say that, in practice, RDDs have a meaningful order concept.
Furthermore, while there are obviously many situations where computing an RDD from another must change the order, in my experience order tends to be preserved where it is possible/reasonable to do so. Operations that don't re-partition and don't fundamentally change the set of elements especially tend to preserve order.
But this brings me to your question about "contract", and indeed the documentation has a problem in this regard. I have not seen a single place where an operation's effect on element order is made clear. (The OrderedRDDFunctions class doesn't count, because it refers to an ordering based on the data, which may differ from the raw order of elements within the RDD. Likewise the RangePartitioner class.) I can see how this might lead you to conclude that there is no concept of element order, but the examples I've given above make that model unsatisfying to me.

Resources