serialize RDD with Avro - apache-spark

I have this scenario. We have to provide a functionality that takes a whatever type of RDD, with the generics notation you could say RDD[T] and serialize and save to HDFS using Avro DataFile.
Beware that the RDD could be of anything, so the functionality should be generic to the given RDD type, for example, RDD[(String, AnyBusinessObject)] o RDD[(String, Date, OtherBusinessObject)].
The question is: how can we infer the Avro schema and provide Avro serialization for a whatever class type in order to save it as Avro Data File?
The functionality is actually already built, but it uses Java Serialization, this obviously causes space and time penalty, so we would like to refactor it. We can't use DataFrames.

You can write avro files using the GenericRecord API (see the "Serializing and deserializing without code generation" section). However, you still need to have the Avro schema.
If you have a DataFrame, Spark handles all of this for you because Spark knows how to do the conversion from Spark SQL types to Avro types.
Since you say you can't use DataFrames, you'll have to do this schema generation yourself. One option is to use Avro's ReflectData API.
Then, once you have the schema you'll do a map to transform all of the elements in the RDD to GenericRecords and use GenericDatumWriter to write it to file.
I'd seriously reconsider these requirements though. IMO, a better design would be to convert from an RDD to a DataFrame so that you can let Spark do the heavy lifting of writing Avro. Or... why even bother with Avro? Just use a file format that allows you to have a generic schema like JSON.

Related

How to handle incorrect parquet schema of some files while reading from HDFS directory?

I am having the below directory in HDFS.
/HDFS/file/date=20200801/id=1
..
/HDFS/file/date=20200831/id=1
/HDFS/file/date=20200901/id=1
/HDFS/file/date=20200902/id=1
/HDFS/file/date=20200903/id=1
...
/HDFS/file/date=20200930/id=1
I am reading these files using
df=spark.read.parquet('/HDFS/file/').option("mergeSchema","true")
The problem is couple of columns in the above file has double format in some files while the same columns have String format in some files.
The merge schema gives me a error that cannot merge schema for double and string type.
Is there a way to handle the schema while reading the parquet files to convert the problem columns to string while reading?
I think the best bet would be to describe the schema explicitly and use it to load the partially-incorrect dataset. The trick I'd use would be to choose the "widest" (the most forgiving) data type (e.g. string) and use it for the columns that are affected by this data incorrectness. Once the dataset is loaded, you could then have another pass over it to convert it to the expected type (e.g. double).
Don't know how to define a schema for 10+ fields in Python without much typing as I'm with Scala (that gives case classes that work well with Encoders utility).

What is the purpose of data types in (Py)Spark?

PySpark offers various data types, however there does not seem to be any useful method we can call on those types. For example, ArrayType does not even have insert, remove, or find methods.
Why is this lack of methods? What is the purpose of data types in Spark if we can do nothing to them? How does Spark handle those types internally?
The types of Spark are not like objects in default languages. They are for serialization purpose allowing Spark store the data in any format that it supports that are: json, parquet, orc, csv and so on, that will allow you to keep the type when you write to your storage.
To have more ways to handle the types, Spark (Scala) allow you to use DataSets that you can use case class to define your types. Then you can use the primitive types to handle the requests.
import spark.implicits._
case class MyData(str1: String, int1: Int, arr1: Array[String])
spark.read.table("my_table").as[MyData]
For PySpark this is a little bit more complicated, but you don't need to worry about serialization.
If you need to manipulate the types in the PySpark, you can use the sql functions.

Spark incorrectly converts a dataset into a dataset of JSON string

I've came across an odd behavior of Apache Spark.
The problem is that I am getting wrong JSON representation of my source dataset when I'm using toJson() method.
To explain problem in more detail, imagine I have typed dataset with this fields:
SomeObject
(
adtp
date
deviceType
...
)
Then I want to map elements of this dataset to JSON using toJson() method (for storing objects in Kafka topic).
But Spark converts this objects into their JSON representation incorrectly.
You can see this behaviour on the screenshots:
Before using toJson(), the object values were:
SomeObject
(
adtp=1
date="2019-04-24"
deviceType="Mobile"
...
)
After using toJson(), the values of the object are:
SomeObject
(
adtp=10
date="Mobile"
deviceType=""
...
)
Can you help me with this sort of problem? I tried to debug spark job but it's not an easy task (I'm not an expert in Scala).
Finally I found out the cause of the problem. I have some JOINs in my data transformations and then I make my dataset typed (using as(...)).
But the problem is that Spark doesn't change the internal schema of the dataset after typing.
And these schemas (one of the source dataset and one of the data model class) may differ. Not only by the presence of columns but also by their order.
So when it comes to conversion of the source dataset to the dataset of JSONs, Spark just takes the schema remaining after the JOINs, and uses it when converting to JSON. And this is the cause of the wrong toJson() conversion.
So the solution is quite simple. Just use one of the transformation dataset functions (map(...) as an example) to explicitly update your dataset schema. So in my case it looks pretty awful but the most important thing is that it works:
.as(Encoders.bean(SomeObject.class))
.map(
(MapFunction<SomeObject, SomeObject>) obj -> obj,
Encoders.bean(SomeObject.class)
);
There is also a ticket on this problem: SPARK-17694.

Dataset predicate pushdow after .as(Encoders.kryo)

Help me please to write an optimal spark query. I have read about predicate pushdown:
When you execute where or filter operators right after loading a
dataset, Spark SQL will try to push the where/filter predicate down to
the data source using a corresponding SQL query with WHERE clause (or
whatever the proper language for the data source is).
Will predicate pushdown works after .as(Encoders.kryo(MyObject.class)) operation?
spark
.read()
.parquet(params.getMyObjectsPath())
// As I understand predicate pushdown will work here
// But I should construct MyObject from org.apache.spark.sql.Row manually
.as(Encoders.kryo(MyObject.class))
// QUESTION: will predicate pushdown work here as well?
.collectAsList();
It won't work. After you use Encoders.kryo you get just a blob which doesn't really benefit from columnar storage and doesn't provide efficient (without object deserialization) access to individual fields, not to mention predicate pushdown or more advanced optimizations.
You could be better off with Encoders.bean if the MyObject class allows for that. In general to get a full advantage of Dataset optimizations you'll need at least a type which can be encoded using more specific encoder.
Related Spark 2.0 Dataset vs DataFrame

pySpark DataFrame FloatType() with file coming in as unicode

Hello I have the following schema:
[StructField(record_id,StringType,true), StructField(offer_id,FloatType,true)]
The file I am importing is coming in as unicode.
For sc.textFiles turning unicode to false still pulls a string error. My question is before I load the data into the dataframe do I have to cleanse it (convert unicode to float before saying it is FloatType?
What is the most efficient way to do this especially as a I scale to 1000's of fields.
It is NOT good practice to convert implicitly between unrelated data types. So (almost) no system can help you to do it automagically. Yes, you have to tell the system and system will accept you are taking the risk of failure in future (what happens if the string field contains "abc" suddenly?)
You should use a map function as translation layer between your sc.textfile and createDataFrame or apply schema step. All casting to correct data types should happen there.
If you have 1000s of fields, you may want to implement an infer-schema mechanism and take some sample of data to decide the schema to use, and then apply it to whole data.
(Assuming Spark 1.3.1 release)

Resources