For DataFrame, it is easy to generate a new column with some operation using a udf with df.withColumn("newCol", myUDF("someCol")). To do something like this in Dataset, I guess I would be using the map function:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
You have to pass the entire case class T as input to the function. If the Dataset[T] has a lot of fields/columns, it would seem very inefficient to be passing the entire row if you just wanted to make one extra column by operating on one of the many columns of T. My question is, is Catalyst smart enough to be able to optimize this?
Is Catalyst smart enough to be able to optimize this?
tl;dr No. See SPARK-14083 Analyze JVM bytecode and turn closures into Catalyst expressions.
There's currently no way Spark SQL's Catalyst Optimizer know what you do in your Scala code.
Quoting SPARK-14083:
One big advantage of the Dataset API is the type safety, at the cost of performance due to heavy reliance on user-defined closures/lambdas. These closures are typically slower than expressions because we have more flexibility to optimize expressions (known data types, no virtual function calls, etc). In many cases, it's actually not going to be very difficult to look into the byte code of these closures and figure out what they are trying to do. If we can understand them, then we can turn them directly into Catalyst expressions for more optimized executions.
And there's even your case mentioned:
df.map(_.name) // equivalent to expression col("name")
As you can see it's still open and I doubt anyone works on this currently.
What you could do to help Spark Optimizer is to select that one column and only then use map operator with a one-argument UDF.
That would certainly match your requirements of not passing the entire JVM object to your function, but would not get rid of this slow deserialization from an internal row representation to your Scala object (that would land on the JVM and occupy some space until a GC happens).
I tried to figure myself since I could not find a response anywhere.
Let's have a dataset which contains case classes with multiple fields:
scala> case class A(x: Int, y: Int)
scala> val dfA = spark.createDataset[A](Seq(A(1, 2)))
scala> val dfX = dfA.map(_.x)
Now if we check the optimized plan we get the following:
scala> val plan = dfX.queryExecution.optimizedPlan
SerializeFromObject [input[0, int, true] AS value#8]
+- MapElements <function1>, obj#7: int
+- DeserializeToObject newInstance(class A), obj#6: A
+- LocalRelation [x#2, y#3]
According to the more verbose plan.toJSON the DeserializeToObject step assumes both x and y to be present.
As you proof take for example the following snippet which uses reflection instead of directly touching the fields of A which still works.
val dfX = dfA.map(
_.getClass.getMethods.find(_.getName == "x").get.invoke(x).asInstanceOf[Int]
)
Related
Does SparkSQL support subquery? lists that currently no subquery support is available for spark 2.0.
Has this changed recently?
Your comment is correct. Your question is a little vague. However, I take your point and find also the concepts fine and also subject to this sort of question, so there you go.
So, this is now possible for the DataFrame API, not DataSet or DSL as you state.
SELECT A.dep_id,
A.employee_id,
A.age,
(SELECT MAX(age)
FROM employee B
WHERE A.dep_id = B.dep_id) max_age
FROM employee A
ORDER BY 1,2
An example - borrowed from the Internet, shows clearly the distinction between DS and DF implying that a SPARK SQL correlated sub-query (not shown here of course) does also not happen against a DataSet - by deduction:
sql("SELECT COUNT(*) FROM src").show()
val sqlDF = sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")
val stringsDS = sqlDF.map {case Row(key: Int, value: String) => s"Key: $key, Value: $value"}
stringsDS.show()
The SQL runs against some source like Hive or Parquet or against SPARK TempViews, not against a DS. From a DF you can go to the DS and then enjoy the more typesafe approach, but only with the limited interface on select. I did a good search to find something that disproves this, but this is not the case. DS and DF are sort of interchangeable anyway as I have stated I think to you earlier. But, I see you are very thorough!
Moreover, there are at least 2 techniques for converting the Nested-Correlated=Subqueries to "normal" JOINs which is what SPARK and indeed other Optimizers do in the background. E.g. RewriteCorrelatedScalarSubquery and PullupCorrelatedPredicate.
But for a DSL, which you allude to, you can re-write your query by hand to achieve the same, by using JOIN, LEFT JOIN, OUTER JOIN, whatever the case may be. Except that is not so obvious for all oddly enough.
I recently played around with UDAFs and looked into the sourcecode of the built-in aggregation function collect_list, I was suprised to see that collect_list does not have a merge method implemented, although I think this is really straight-farward (just concatenate two Arrays). Code taken from org.apache.spark.sql.catalyst.expressions.aggregate.collect.Collect
override def merge(buffer: InternalRow, input: InternalRow): Unit = {
sys.error("Collect cannot be used in partial aggregations.")
}
It is no longer the case, as SPARK-1893 but I'd assume that the initial design had mostly collect_list in mind.
Because collect_list is logically equivalent to groupByKey the motivation would be exactly the same to avoid long GC pauses. In particular map side combine in groupByKey has been disabled with Spark SPARK-772:
Map side combine in group by key case does not reduce the amount of data shuffled. Instead, it forces a lot more objects to go into old gen, and leads to worse GC.
So to address you comment
I think this is really straight-farward (just concatenate two Arrays).
It might be simple but it doesn't add much value (unless there is another reducing operation on top of it) and sequence concatenation is expensive.
Currently working on PySpark. There is no map function on DataFrame, and one has to go to RDD for map function. In Scala there is a map on DataFrame, is there any reason for this?
Dataset.map is not part of the DataFrame (Dataset[Row]) API. It transforms strongly typed Dataset[T] into strongly typed Dataset[U]:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
and there is simply no place for Python in the strongly typed Dataset world. In general, Datasets are native JVM objects (unlike RDD it has not Python specific implementation) which depend heavily on rich Scala type system (even Java API is severely limited). Even if Python implemented some variant of the Encoder API, data would still have to be converted to RDD for computations.
In contrast Python implements its own map like mechanism with vectorized udfs, which should be released in Spark 2.3. It is focused on high performance serde implementation coupled with Pandas API.
That includes both typical udfs (in particular SCALAR and SCALAR_ITER variants) as well as map-like variants - GROUPED_MAP and MAP_ITER applied through GroupedData.apply and DataFrame.mapInPandas (Spark >= 3.0.0) respectively.
After reading few great articles (this, this and this) about Spark's DataSets, I finishing with next DataSet's performance benefits over RDD:
Logical and physical plan optimization;
Strict typization;
Vectorized operations;
Low level memory management.
Questions:
Spark's RDD also builds physical plan and can combine/optimize multiple transformations at the same stage. Then what is the benefit of DataSet over RDD?
From the first link you can see an example of RDD[Person]. Does DataSet have advanced typization?
What do they mean by "vectorized operations"?
As I understand, DataSet's low memory management = advanced serialization. That means off-heap storage of serializable objects, where you can read only one field of an object without deserialization. But how about the situation when you have IN_MEMORY_ONLY persistence strategy? Will DataSet serialize everything any case? Will it have any performance benefit over RDD?
Spark's RDD also builds physical plan and can combine/optimize multiple transformations at the same stage. Than what is the benefit of DataSet over RDD?
When working with RDD what you write is what you get. While certain transformations are optimized by chaining, the execution plan is direct translation of the DAG. For example:
rdd.mapPartitions(f).mapPartitions(g).mapPartitions(h).shuffle()
where shuffle is an arbitrary shuffling transformation (*byKey, repartition, etc.) all three mapPartitions (map, flatMap, filter) will be chained without creating intermediate objects but cannot be rearranged.
Compared to that Datasets use significantly more restrictive programming model but can optimize execution using a number of techniques including:
Selection (filter) pushdown. For example if you have:
df.withColumn("foo", col("bar") + 1).where(col("bar").isNotNull())
can be executed as:
df.where(col("bar").isNotNull()).withColumn("foo", col("bar") + 1)
Early projections (select) and eliminations. For example:
df.withColumn("foo", col("bar") + 1).select("foo", "bar")
can be rewritten as:
df.select("foo", "bar").withColumn("foo", col("bar") + 1)
to avoid fetching and passing obsolete data. In the extreme case it can eliminate particular transformation completely:
df.withColumn("foo", col("bar") + 1).select("bar")
can be optimized to
df.select("bar")
These optimizations are possible for two reasons:
Restrictive data model which enables dependency analysis without complex and unreliable static code analysis.
Clear operator semantics. Operators are side effects free and we clearly distinguish between deterministic and nondeterministic ones.
To make it clear let's say we have a following data model:
case class Person(name: String, surname: String, age: Int)
val people: RDD[Person] = ???
And we want to retrieve surnames of all people older than 21. With RDD it can be expressed as:
people
.map(p => (p.surname, p.age)) // f
.filter { case (_, age) => age > 21 } // g
Now let's ask ourselves a few questions:
What is the relationship between the input age in f and age variable with g?
Is f and then g the same as g and then f?
Are f and g side effects free?
While the answer is obvious for a human reader it is not for a hypothetical optimizer. Compared to that with Dataframe version:
people.toDF
.select(col("surname"), col("age")) // f'
.where(col("age") > 21) // g'
the answers are clear for both optimizer and human reader.
This has some further consequences when using statically typed Datasets (Spark 2.0 Dataset vs DataFrame).
Have DataSet got more advanced typization?
No - if you care about optimizations. The most advanced optimizations are limited to Dataset[Row] and at this moment it is not possible to encode complex type hierarchy.
Maybe - if you accept overhead of the Kryo or Java encoders.
What does they mean by "vectorized operations"?
In context of optimization we usually mean loop vectorization / loop unrolling. Spark SQL uses code generation to create compiler friendly version of the high level transformations which can be further optimized to take advantage of the vectorized instruction sets.
As I understand, DataSet's low memory management = advanced serialization.
Not exactly. The biggest advantage of using native allocation is escaping garbage collector loop. Since garbage collections is quite often a limiting factor in Spark this is a huge improvement, especially in contexts which require large data structures (like preparing shuffles).
Another important aspect is columnar storage which enables effective compression (potentially lower memory footprint) and optimized operations on compressed data.
In general you can apply exactly the same types of optimizations using hand crafted code on plain RDDs. After all Datasets are backed by RDDs. The difference is only how much effort it takes.
Hand crafted execution plan optimizations are relatively simple to achieve.
Making code compiler friendly requires some deeper knowledge and is error prone and verbose.
Using sun.misc.Unsafe with native memory allocation is not for the faint-hearted.
Despite all its merits Dataset API is not universal. While certain types of common tasks can benefit from its optimizations in many contexts you may so no improvement whatsoever or even performance degradation compared to RDD equivalent.
I saw the following post a little bit back: Understanding TreeReduce in Spark
I am still trying to exactly understand when to use a treeReduce vs a reduceByKey. I think we can use a universal example like a word count to help me further understand what is going on.
Does it always make sense to use reduceByKey in a word count?
Or is there a particular size of data when treeReduce makes more sense?
Are there particular cases or rules of thumbs when treeReduce is the better option?
Also this may be answered in the above based on reduceByKey but does anything change with reduceByKeyLocally and treeReduce
How do I appropriately determine depth?
Edit: So playing in spark-shell, I think I fundamentally don't understand the concept of treeReduce but hopefully an example and those question help.
res2: Array[(String, Int)] = Array((D,1), (18964,1), (D,1), (1,1), ("",1), ("",1), ("",1), ("",1), ("",1), (1,1))
scala> val reduce = input.reduceByKey(_+_)
reduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[11] at reduceByKey at <console>:25
scala> val tree = input.treeReduce(_+_, 2)
<console>:25: error: type mismatch;
found : (String, Int)
required: String
val tree = input.treeReduce(_+_, 2)
There is a fundamental difference between the two-reduceByKey is only available on key-value pair RDDs, while treeReduce is a generalization of reduce operation on any RDD. reduceByKey is used for implementing treeReduce but they are not related in any other sense.
reduceByKey performs reduction per each key, resulting in an RDD; it is not an "action" in RDD sense but a transformation that returns a ShuffleRDD. This is equivalent to groupByKey followed by a map that does key-wise reduction (check this why using groupByKey is inefficient).
On the other hand, treeAggregate is a generalization of reduce function, inspired from AllReduce. This is an "action" in spark sense, returning the result on the master node. As explained the link posted in your question, after performing local reduce operation, reduce performs rest of the computation on the master, which can be very burdensome (especially in machine learning when the reduce function results in a large vectors or a matrices). Instead, treeReduce perform the reduction in parallel using reduceByKey (this is done by creating a key-value pair RDD on the fly, with the keys determined by the depth of the tree; check implementation here).
So, to answer your first two questions, you have to use reduceByKey for word count since you are interested in getting per word-count and treeReduce is not appropriate here. The other two questions are not related to this topic.