Dataset predicate pushdow after .as(Encoders.kryo) - apache-spark

Help me please to write an optimal spark query. I have read about predicate pushdown:
When you execute where or filter operators right after loading a
dataset, Spark SQL will try to push the where/filter predicate down to
the data source using a corresponding SQL query with WHERE clause (or
whatever the proper language for the data source is).
Will predicate pushdown works after .as(Encoders.kryo(MyObject.class)) operation?
spark
.read()
.parquet(params.getMyObjectsPath())
// As I understand predicate pushdown will work here
// But I should construct MyObject from org.apache.spark.sql.Row manually
.as(Encoders.kryo(MyObject.class))
// QUESTION: will predicate pushdown work here as well?
.collectAsList();

It won't work. After you use Encoders.kryo you get just a blob which doesn't really benefit from columnar storage and doesn't provide efficient (without object deserialization) access to individual fields, not to mention predicate pushdown or more advanced optimizations.
You could be better off with Encoders.bean if the MyObject class allows for that. In general to get a full advantage of Dataset optimizations you'll need at least a type which can be encoded using more specific encoder.
Related Spark 2.0 Dataset vs DataFrame

Related

Understanding pandas_udf

The documentation page of pandas_udf in pyspark documentation has the following paragraph:
The user-defined functions do not support conditional expressions or short circuiting in boolean expressions and it ends up with being executed all internally. If the functions can fail on special rows, the workaround is to incorporate the condition into the functions.
Can somebody explain to me what this means? It seems it is saying that the UDF does not support conditional statements (if else blocks) and then suggesting that the workaround is to include the if else condition in the function body. This does not make sense to me. Please help
I read something similar in Learning Spark - Lightning-Fast Data Analytics
In Chapter 5 - User Defined Functions it talks about evaluation order and null checking in Spark SQL.
If your UDF can fail when dealing with NULL values it's best to move that logic inside the UDF itself just like it says in the quote you provided.
Here's the reasoning behind it:
Spark SQL (this includes DataFrame API and Dataset API) does not quarantee the order of evaluation of subexpressions. For example the following query does not guarantee that the s IS NOT NULL clause is executed prior to the strlen(s):
spark.sql("SELECT s FROM test1 WHERE s IS NOT NULL AND strlen(s) > 1")
Therefore to perform proper null checking it is recommended that you make the UDF itself null-aware and do null checking inside the UDF.

What is the purpose of data types in (Py)Spark?

PySpark offers various data types, however there does not seem to be any useful method we can call on those types. For example, ArrayType does not even have insert, remove, or find methods.
Why is this lack of methods? What is the purpose of data types in Spark if we can do nothing to them? How does Spark handle those types internally?
The types of Spark are not like objects in default languages. They are for serialization purpose allowing Spark store the data in any format that it supports that are: json, parquet, orc, csv and so on, that will allow you to keep the type when you write to your storage.
To have more ways to handle the types, Spark (Scala) allow you to use DataSets that you can use case class to define your types. Then you can use the primitive types to handle the requests.
import spark.implicits._
case class MyData(str1: String, int1: Int, arr1: Array[String])
spark.read.table("my_table").as[MyData]
For PySpark this is a little bit more complicated, but you don't need to worry about serialization.
If you need to manipulate the types in the PySpark, you can use the sql functions.

Spark incorrectly converts a dataset into a dataset of JSON string

I've came across an odd behavior of Apache Spark.
The problem is that I am getting wrong JSON representation of my source dataset when I'm using toJson() method.
To explain problem in more detail, imagine I have typed dataset with this fields:
SomeObject
(
adtp
date
deviceType
...
)
Then I want to map elements of this dataset to JSON using toJson() method (for storing objects in Kafka topic).
But Spark converts this objects into their JSON representation incorrectly.
You can see this behaviour on the screenshots:
Before using toJson(), the object values were:
SomeObject
(
adtp=1
date="2019-04-24"
deviceType="Mobile"
...
)
After using toJson(), the values of the object are:
SomeObject
(
adtp=10
date="Mobile"
deviceType=""
...
)
Can you help me with this sort of problem? I tried to debug spark job but it's not an easy task (I'm not an expert in Scala).
Finally I found out the cause of the problem. I have some JOINs in my data transformations and then I make my dataset typed (using as(...)).
But the problem is that Spark doesn't change the internal schema of the dataset after typing.
And these schemas (one of the source dataset and one of the data model class) may differ. Not only by the presence of columns but also by their order.
So when it comes to conversion of the source dataset to the dataset of JSONs, Spark just takes the schema remaining after the JOINs, and uses it when converting to JSON. And this is the cause of the wrong toJson() conversion.
So the solution is quite simple. Just use one of the transformation dataset functions (map(...) as an example) to explicitly update your dataset schema. So in my case it looks pretty awful but the most important thing is that it works:
.as(Encoders.bean(SomeObject.class))
.map(
(MapFunction<SomeObject, SomeObject>) obj -> obj,
Encoders.bean(SomeObject.class)
);
There is also a ticket on this problem: SPARK-17694.

serialize RDD with Avro

I have this scenario. We have to provide a functionality that takes a whatever type of RDD, with the generics notation you could say RDD[T] and serialize and save to HDFS using Avro DataFile.
Beware that the RDD could be of anything, so the functionality should be generic to the given RDD type, for example, RDD[(String, AnyBusinessObject)] o RDD[(String, Date, OtherBusinessObject)].
The question is: how can we infer the Avro schema and provide Avro serialization for a whatever class type in order to save it as Avro Data File?
The functionality is actually already built, but it uses Java Serialization, this obviously causes space and time penalty, so we would like to refactor it. We can't use DataFrames.
You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.
If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.
Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.
Then, once you have the schema you'll do a map to transform all of the elements in the RDD to GenericRecords and use GenericDatumWriter to write it to file.
I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.

Understanding RDD and DataSet

From the DataSet and RDD documentation,
DataSet:
A Dataset is a strongly typed collection of domain-specific objects
that can be transformed in parallel using functional or relational
operations. Each dataset also has an untyped view called a DataFrame,
which is a Dataset of Row
RDD:
RDD represents an immutable,partitioned collection of elements that
can be operated on in parallel
Also, it is said the difference between them:
The major difference is, dataset is collection of domain specific
objects where as RDD is collection of any object. Domain object part
of definition signifies the schema part of dataset. So dataset API is
always strongly typed and optimized using schema where RDD is not.
I have two questions here;
what does it mean dataset is collection of domain specific objects while RDD is collection of any object,Given a case class Person, I thought DataSet[Person] and RDD[Person] are both collection of domain specific objects
dataset API is always strongly typed and optimized using schema where RDD is not Why is it said that dataset API always strongly typed while RDD not? I thought RDD[Person] is also strong typed
Strongly typed Dataset (not DataFrame) is a collection of record types (Scala Products) which are mapped to internal storage format using so called Encoders, while RDD can store arbitrary serializable (Serializable or Kryo serializable object). Therefore as a container RDD is much more generic than Dataset.
Following:
. So dataset API is always strongly typed (...) where RDD is not.
is an utter absurd, showing that you shouldn't trust everything you can find on the Internet. In general Dataset API has significantly weaker type protections, than RDD. This is particularly obvious when working Dataset[Row], but applies to any Dataset.
Consider for example following:
case class FooBar(id: Int, foos: Seq[Int])
Seq[(Integer, Integer)]((1, null))
.toDF.select($"_1" as "id", array($"_2") as "foos")
.as[FooBar]
which clearly breaks type safety.

Resources