Understanding RDD and DataSet - apache-spark

From the DataSet and RDD documentation,
DataSet:
A Dataset is a strongly typed collection of domain-specific objects
that can be transformed in parallel using functional or relational
operations. Each dataset also has an untyped view called a DataFrame,
which is a Dataset of Row
RDD:
RDD represents an immutable,partitioned collection of elements that
can be operated on in parallel
Also, it is said the difference between them:
The major difference is, dataset is collection of domain specific
objects where as RDD is collection of any object. Domain object part
of definition signifies the schema part of dataset. So dataset API is
always strongly typed and optimized using schema where RDD is not.
I have two questions here;
what does it mean dataset is collection of domain specific objects while RDD is collection of any object,Given a case class Person, I thought DataSet[Person] and RDD[Person] are both collection of domain specific objects
dataset API is always strongly typed and optimized using schema where RDD is not Why is it said that dataset API always strongly typed while RDD not? I thought RDD[Person] is also strong typed

Strongly typed Dataset (not DataFrame) is a collection of record types (Scala Products) which are mapped to internal storage format using so called Encoders, while RDD can store arbitrary serializable (Serializable or Kryo serializable object). Therefore as a container RDD is much more generic than Dataset.
Following:
. So dataset API is always strongly typed (...) where RDD is not.
is an utter absurd, showing that you shouldn't trust everything you can find on the Internet. In general Dataset API has significantly weaker type protections, than RDD. This is particularly obvious when working Dataset[Row], but applies to any Dataset.
Consider for example following:
case class FooBar(id: Int, foos: Seq[Int])
Seq[(Integer, Integer)]((1, null))
.toDF.select($"_1" as "id", array($"_2") as "foos")
.as[FooBar]
which clearly breaks type safety.

Related

Spark incorrectly converts a dataset into a dataset of JSON string

I've came across an odd behavior of Apache Spark.
The problem is that I am getting wrong JSON representation of my source dataset when I'm using toJson() method.
To explain problem in more detail, imagine I have typed dataset with this fields:
SomeObject
(
adtp
date
deviceType
...
)
Then I want to map elements of this dataset to JSON using toJson() method (for storing objects in Kafka topic).
But Spark converts this objects into their JSON representation incorrectly.
You can see this behaviour on the screenshots:
Before using toJson(), the object values were:
SomeObject
(
adtp=1
date="2019-04-24"
deviceType="Mobile"
...
)
After using toJson(), the values of the object are:
SomeObject
(
adtp=10
date="Mobile"
deviceType=""
...
)
Can you help me with this sort of problem? I tried to debug spark job but it's not an easy task (I'm not an expert in Scala).
Finally I found out the cause of the problem. I have some JOINs in my data transformations and then I make my dataset typed (using as(...)).
But the problem is that Spark doesn't change the internal schema of the dataset after typing.
And these schemas (one of the source dataset and one of the data model class) may differ. Not only by the presence of columns but also by their order.
So when it comes to conversion of the source dataset to the dataset of JSONs, Spark just takes the schema remaining after the JOINs, and uses it when converting to JSON. And this is the cause of the wrong toJson() conversion.
So the solution is quite simple. Just use one of the transformation dataset functions (map(...) as an example) to explicitly update your dataset schema. So in my case it looks pretty awful but the most important thing is that it works:
.as(Encoders.bean(SomeObject.class))
.map(
(MapFunction<SomeObject, SomeObject>) obj -> obj,
Encoders.bean(SomeObject.class)
);
There is also a ticket on this problem: SPARK-17694.

If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?

This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.
My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?
As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.
Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.
However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement
df = df.withColumn()
It will generate another dataframe and assign it to reference "df".
In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.
df.rdd.id()
will give you unique identifier for your dataframe.
I hope the above explanation helps.
Regards,
Neeraj
You aren't; the documentation explicitly says
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
If you keep a variable referring to the dataframe you called withColumn on, it won't have the new column.
The Core Data structure of Spark, i.e., the RDD itself is immutable. This nature is pretty much similar to a string in Java which is immutable as well.
When you concat a string with another literal you are not modifying the original string, you are actually creating a new one altogether.
Similarly, either the Dataframe or the Dataset, whenever you alter that RDD by either adding a column or dropping one you are not changing anything in it, instead you are creating a new Dataset/Dataframe.

Performance benefits of DataSet over RDD

After reading few great articles (this, this and this) about Spark's DataSets, I finishing with next DataSet's performance benefits over RDD:
Logical and physical plan optimization;
Strict typization;
Vectorized operations;
Low level memory management.
Questions:
Spark's RDD also builds physical plan and can combine/optimize multiple transformations at the same stage. Then what is the benefit of DataSet over RDD?
From the first link you can see an example of RDD[Person]. Does DataSet have advanced typization?
What do they mean by "vectorized operations"?
As I understand, DataSet's low memory management = advanced serialization. That means off-heap storage of serializable objects, where you can read only one field of an object without deserialization. But how about the situation when you have IN_MEMORY_ONLY persistence strategy? Will DataSet serialize everything any case? Will it have any performance benefit over RDD?
Spark's RDD also builds physical plan and can combine/optimize multiple transformations at the same stage. Than what is the benefit of DataSet over RDD?
When working with RDD what you write is what you get. While certain transformations are optimized by chaining, the execution plan is direct translation of the DAG. For example:
rdd.mapPartitions(f).mapPartitions(g).mapPartitions(h).shuffle()
where shuffle is an arbitrary shuffling transformation (*byKey, repartition, etc.) all three mapPartitions (map, flatMap, filter) will be chained without creating intermediate objects but cannot be rearranged.
Compared to that Datasets use significantly more restrictive programming model but can optimize execution using a number of techniques including:
Selection (filter) pushdown. For example if you have:
df.withColumn("foo", col("bar") + 1).where(col("bar").isNotNull())
can be executed as:
df.where(col("bar").isNotNull()).withColumn("foo", col("bar") + 1)
Early projections (select) and eliminations. For example:
df.withColumn("foo", col("bar") + 1).select("foo", "bar")
can be rewritten as:
df.select("foo", "bar").withColumn("foo", col("bar") + 1)
to avoid fetching and passing obsolete data. In the extreme case it can eliminate particular transformation completely:
df.withColumn("foo", col("bar") + 1).select("bar")
can be optimized to
df.select("bar")
These optimizations are possible for two reasons:
Restrictive data model which enables dependency analysis without complex and unreliable static code analysis.
Clear operator semantics. Operators are side effects free and we clearly distinguish between deterministic and nondeterministic ones.
To make it clear let's say we have a following data model:
case class Person(name: String, surname: String, age: Int)
val people: RDD[Person] = ???
And we want to retrieve surnames of all people older than 21. With RDD it can be expressed as:
people
.map(p => (p.surname, p.age)) // f
.filter { case (_, age) => age > 21 } // g
Now let's ask ourselves a few questions:
What is the relationship between the input age in f and age variable with g?
Is f and then g the same as g and then f?
Are f and g side effects free?
While the answer is obvious for a human reader it is not for a hypothetical optimizer. Compared to that with Dataframe version:
people.toDF
.select(col("surname"), col("age")) // f'
.where(col("age") > 21) // g'
the answers are clear for both optimizer and human reader.
This has some further consequences when using statically typed Datasets (Spark 2.0 Dataset vs DataFrame).
Have DataSet got more advanced typization?
No - if you care about optimizations. The most advanced optimizations are limited to Dataset[Row] and at this moment it is not possible to encode complex type hierarchy.
Maybe - if you accept overhead of the Kryo or Java encoders.
What does they mean by "vectorized operations"?
In context of optimization we usually mean loop vectorization / loop unrolling. Spark SQL uses code generation to create compiler friendly version of the high level transformations which can be further optimized to take advantage of the vectorized instruction sets.
As I understand, DataSet's low memory management = advanced serialization.
Not exactly. The biggest advantage of using native allocation is escaping garbage collector loop. Since garbage collections is quite often a limiting factor in Spark this is a huge improvement, especially in contexts which require large data structures (like preparing shuffles).
Another important aspect is columnar storage which enables effective compression (potentially lower memory footprint) and optimized operations on compressed data.
In general you can apply exactly the same types of optimizations using hand crafted code on plain RDDs. After all Datasets are backed by RDDs. The difference is only how much effort it takes.
Hand crafted execution plan optimizations are relatively simple to achieve.
Making code compiler friendly requires some deeper knowledge and is error prone and verbose.
Using sun.misc.Unsafe with native memory allocation is not for the faint-hearted.
Despite all its merits Dataset API is not universal. While certain types of common tasks can benefit from its optimizations in many contexts you may so no improvement whatsoever or even performance degradation compared to RDD equivalent.

Pyspark wrapper for H2O POJO

I created model using H2O's Sparkling Water. And now I'd like to apply it to huge Spark DF (populated with sparse vectors). I use python and pyspark, pysparkling. Basically I need to do map job with model.predict() function inside. But copying data into H2O context is huge overhead and not an option. What I think I gonna do is, extract POJO (Java class) model from h2o model and use it to do map in dataframe. My questions are:
Is there a better way?
How to write pyspark wrapper for java class, from which I intend to use only one method .score(double[] data, double[] result)
How to maximally reuse wrappers from Spark ML library?
Thank you!
In this case, you can:
1) use h2o.predict(H2OFrame) method to generate prediction, but you need to transform RDD to H2OFrame. It is not the perfect solution...however, for some cases, it can provide reasonable solution.
2) switch to JVM and call JVM directly via Spark's Py4J gateway
This is not fully working solution right now, since the method score0 needs to accept non-primitive types on H2O side and also to be visible (right now it is protected),
but at least idea:
model = sc._jvm.water.DKV.getGet("deeplearning.model")
double_class = sc._jvm.double
row = sc._gateway.new_array(double_class, nfeatures)
row[0] = ...
...
row[nfeatures-1] = ...
prediction = model.score0(row)
I created JIRA improvement for this case https://0xdata.atlassian.net/browse/PUBDEV-2726
However, workaround is to create a Java wrapper around model which would
expose right shape of score0 function:
class ModelWrapper extends Model {
public double[] score(double[] row) {
return score0(row)
}
}
Please see also hex.ModelUtils: https://github.com/h2oai/sparkling-water/blob/master/core/src/main/scala/hex/ModelUtils.scala
(again you can call them directly via Py4J gateway exposed by Spark)

Mind blown: RDD.zip() method

I just discovered the RDD.zip() method and I cannot imagine what its contract could possibly be.
I understand what it does, of course. However, it has always been my understanding that
the order of elements in an RDD is a meaningless concept
the number of partitions and their sizes is an implementation detail only available to the user for performance tuning
In other words, an RDD is a (multi)set, not a sequence (and, of course, in, e.g., Python one gets AttributeError: 'set' object has no attribute 'zip')
What is wrong with my understanding above?
What was the rationale behind this method?
Is it legal outside the trivial context like a.map(f).zip(a)?
EDIT 1:
Another crazy method is zipWithIndex(), as well as well as the various zipPartitions() variants.
Note that first() and take() are not crazy because they are just (non-random) samples of the RDD.
collect() is also okay - it just converts a set to a sequence which is perfectly legit.
EDIT 2: The reply says:
when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.
This appears to imply that even the trivial a.map(f).zip(a) is not guaranteed to be equivalent to a.map(x => (f(x),x)). What is the situation when zip() results are reproducible?
It is not true that RDDs are always unordered. An RDD has a guaranteed order if it is the result of a sortBy operation, for example. An RDD is not a set; it can contain duplicates. Partitioning is not opaque to the caller, and can be controlled and queried. Many operations do preserve both partitioning and order, like map. That said I find it a little easy to accidentally violate the assumptions that zip depends on, since they're a little subtle, but it certainly has a purpose.
The mental model I use (and recommend) is that the elements of an RDD are ordered, but when you compute one RDD from another the order of elements in the new RDD may not correspond to that in the old one.
For those who want to be aware of partitions, I'd say that:
The partitions of an RDD have an order.
The elements within a partition have an order.
If you think of "concatenating" the partitions (say laying them "end to end" in order) using the order of elements within them, the overall ordering you end up with corresponds to the order of elements if you ignore partitions.
But again, if you compute one RDD from another, all bets about the order relationships of the two RDDs are off.
Several members of the RDD class (I'm referring to the Scala API) strongly suggest an order concept (as does their documentation):
collect()
first()
partitions
take()
zipWithIndex()
as does Partition.index as well as SparkContext.parallelize() and SparkContext.makeRDD() (which both take a Seq[T]).
In my experience these ways of "observing" order give results that are consistent with each other, and the ones that translate back and forth between RDDs and ordered Scala collections behave as you would expect -- they preserve the overall order of elements. This is why I say that, in practice, RDDs have a meaningful order concept.
Furthermore, while there are obviously many situations where computing an RDD from another must change the order, in my experience order tends to be preserved where it is possible/reasonable to do so. Operations that don't re-partition and don't fundamentally change the set of elements especially tend to preserve order.
But this brings me to your question about "contract", and indeed the documentation has a problem in this regard. I have not seen a single place where an operation's effect on element order is made clear. (The OrderedRDDFunctions class doesn't count, because it refers to an ordering based on the data, which may differ from the raw order of elements within the RDD. Likewise the RangePartitioner class.) I can see how this might lead you to conclude that there is no concept of element order, but the examples I've given above make that model unsatisfying to me.

Resources