JSON to dataset in Spark - apache-spark

I am facing an issue for which I am seeking your help. I have a task to convert a JSON file to dataSet so that it can be loaded into HIVE.
Code 1
SparkSession spark1 = SparkSession
.builder()
.appName("File_Validation")
.config("spark.some.config.option", "some-value")
.getOrCreate();
Dataset<Row> df = spark1.read().json("input/sample.json");
df.show();
Above code is throwing me a NullPointerException.
I tried another way
Code 2
JavaRDD<String> jsonFile = context.textFile("input/sample.json");
Dataset<Row> df2 = spark1.read().json(jsonFile);
df2.show();
created an RDD and passed it to the spark1 (sparkSession)
this code 2 is making the json to a different format with header as
+--------------------+
| _corrupt_record|
+--------------------+
with schema as - |-- _corrupt_record: string (nullable = true)
Please help in fixing it.
Sample JSON
{
"user": "gT35Hhhre9m",
"dates": ["2016-01-29", "2016-01-28"],
"status": "OK",
"reason": "some reason",
"content": [{
"foo": 123,
"bar": "val1"
}, {
"foo": 456,
"bar": "val2"
}, {
"foo": 789,
"bar": "val3"
}, {
"foo": 124,
"bar": "val4"
}, {
"foo": 126,
"bar": "val5"
}]
}

Your JSON should be in one line - one json in one line per one object.
In example:
{ "property1: 1 }
{ "property1: 2 }
It will be read as Dataset with 2 objects inside and one column
From documentation:
Note that the file that is offered as a json file is not a typical
JSON file. Each line must contain a separate, self-contained valid
JSON object. As a consequence, a regular multi-line JSON file will
most often fail.
Of course read data with SparkSession, as it will inference schema

You cant read formated JSON in spark, Your JSON should be in a single line like this :
{"user": "gT35Hhhre9m","dates": ["2016-01-29", "2016-01-28"],"status": "OK","reason": "some reason","content": [{"foo": 123,"bar": "val1"}, {"foo": 456,"bar": "val2"}, {"foo": 789,"bar": "val3"}, {"foo": 124,"bar": "val4"}, {"foo": 126,"bar": "val5"}]}
Or it could be multi-lined JSON like this :
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

Related

Work with decimal values after avro deserialization

I take AVRO bytes from Kafka and deserialize them.
But I get strange output because of decimal value and I cannot work with them next (for example turn into json or insert into DB):
import avro.schema, json
from avro.io import DatumReader, BinaryDecoder
# only needed part of schemaDict
schemaDict = {
"name": "ApplicationEvent",
"type": "record",
"fields": [
{
"name": "desiredCreditLimit",
"type": [
"null",
{
"type": "bytes",
"logicalType": "decimal",
"precision": 14,
"scale": 2
}
],
"default": null
}
]
}
schema_avro = avro.schema.parse(json.dumps(schemaDict))
reader = DatumReader(schema_avro_io)
decoder = BinaryDecoder(data) #data - bynary data from kafka
event_dict = reader.read(decoder)
print (event_dict)
#{'desiredCreditLimit': Decimal('100000.00')}
print (json.dumps(event_dict))
#TypeError: Object of type Decimal is not JSON serializable
I tried to use avro_json_serializer, but got error: "AttributeError: 'decimal.Decimal' object has no attribute 'decode'".
And because of this "Decimal" in dictionary I cannot insert values to DB too.
Also tried to use fastavro library, but I couldnot deserealize message, as I understand because sereliazation done without fastavro.

How to encode structs into Avro record in Spark?

I'm trying to use to_avro() function to create Avro records. However, I'm not able to encode multiple columns, as some columns are simply lost after encoding. A simple example to recreate the problem:
val schema = StructType(List(
StructField("entity_type", StringType),
StructField("entity", StringType)
))
val rdd = sc.parallelize(Seq(
Row("PERSON", "John Doe")
))
val df = sqlContext.createDataFrame(rdd, schema)
df
.withColumn("struct", struct(col("entity_type"), col("entity")))
.select("struct")
.collect()
.foreach(println)
// prints [[PERSON, John Doe]]
df
.withColumn("struct", struct(col("entity_type"), col("entity")))
.select(to_avro(col("struct")).as("value"))
.select(from_avro(col("value"), entitySchema).as("entity"))
.collect()
.foreach(println)
// prints [[, PERSON]]
My schema looks like this
{
"type" : "record",
"name" : "Entity",
"fields" : [ {
"name" : "entity_type",
"type" : "string"
},
{
"name" : "entity",
"type" : "string"
} ]
}
What's interesting, is if I change the column order in the struct, the result would be [, John Doe]
I'm using Spark 2.4.5. According to Spark documentation: "to_avro() can be used to turn structs into Avro records. This method is particularly useful when you would like to re-encode multiple columns into a single one when writing data out to Kafka."
It's working after changing field types from "string" to ["string", "null"]. Not sure if this behavior is intended though.

Spark can not process recursive avro data

I have avsc schema like below:
{
"name": "address",
"type": [
"null",
{
"type":"record",
"name":"Address",
"namespace":"com.data",
"fields":[
{
"name":"address",
"type":[ "null","com.data.Address"],
"default":null
}
]
}
],
"default": null
}
On loading this data in pyspark:
jsonFormatSchema = open("Address.avsc", "r").read()
spark = SparkSession.builder.appName('abc').getOrCreate()
df = spark.read.format("avro")\
.option("avroSchema", jsonFormatSchema)\
.load("xxx.avro")
I got such exception:
"Found recursive reference in Avro schema, which can not be processed by Spark"
I tried many other configurations, but without any success.
To execute I use with spark-submit:
--packages org.apache.spark:spark-avro_2.12:3.0.1
This is a intended feature, you can take a look at the "issue" :
https://issues.apache.org/jira/browse/SPARK-25718

How can I put Amazon example-dataset in a spark movie-recommendation in spark?

I want to put Amazon data(metadata.json) in the spark example movie-recommendation
Movie-recommendations use the following format, but Amazon data uses a string instead of an integer
Below is a source of movie-recommendations.
UserID::MovieID::Rating::Timestamp // ratings.dat format
MovieID::Title::Genres // movies.dat format
val ratings = sc.textFile(new File(movieLensHomeDir, "ratings.dat").toString).map { line =>
val fields = line.split("::")
// format: (timestamp % 10, Rating(userId, movieId, rating))
(fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))
}
val movies = sc.textFile(new File(movieLensHomeDir, "movies.dat").toString).map { line =>
val fields = line.split("::")
// format: (movieId, movieName)
(fields(0).toInt, fields(1))
}.collect().toMap
[spark MLlib example - Movie Recommendation]
https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html
And this is the Amazon dataset
{
"asin": "0000031852",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg",
"related":
{
"also_bought": ["B00JHONN1S", "B002BZX8Z6"],
"also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "bought_together": ["B002BZX8Z6"]
},
"salesRank": {"Toys & Games": 211836},
"brand": "Coxlures",
"categories": [["Sports & Outdoors", "Other Sports", "Dance"]]
}
{
...
}
[amazon review dataset - metadata]
############# SUMMARY ##############
I want to parse this json file and put it in the spark example.
and do not know change the stringID(asin,title,...) to a unique integerID and how to get results
I proceeded to parse the SQLparser, but suddenly it does not work and I want to know another way.
At first it did not happen, but could suddenly occured error because maybe the jsonfile format is broken?
SQLContext_error.jpg
Spark does not work with typical json files. Spark expects json files with each line as a complete json object. That's why regular multi array json files are failed, and you get a [_corrupt_record:String]. So, you have to change json a little bit so that Spark. I modified your json file and it works without any issue -
{"asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17,"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg","related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6"], "also_viewed":["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]]}
{"asin": "0000031853", "title": "AmazonTitle", "price": 5.20,"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg","related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6"], "also_viewed":["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]]}
Following is the code & output.
val rdd= sqlContext.read.json("metadata.json")
rdd: org.apache.spark.sql.DataFrame = [asin: string, brand: string, categories: array<array<string>>, imUrl: string, price: double, related: struct<also_bought:array<string>,also_viewed:array<string>,bought_together:array<string>>, salesRank: struct<Toys & Games:bigint>, title: string

DataFrame partitionBy on nested columns

I am trying to call partitionBy on a nested field like below:
val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("data.dataDetails.name").parquet(filenameParquet)
I get the below error when I run it. I do see the 'name' listed as the field in the below schema. Is there a different format to specify the column name which is nested?
java.lang.RuntimeException: Partition column data.dataDetails.name not found in schema StructType(StructField(name,StringType,true), StructField(time,StringType,true), StructField(data,StructType(StructField(dataDetails,StructType(StructField(name,StringType,true), StructField(id,StringType,true),true)),true))
This is my json file:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "EventName"
"id": "1234"
}
}
}
This appears to be a known issue listed here: https://issues.apache.org/jira/browse/SPARK-18084
I had this issue as well and to work around it I was able to un-nest the columns on my dataset. My dataset was a little different than your dataset, but here is the strategy...
Original Json:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "EventName"
"id": "1234"
}
}
}
Modified Json:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data_type": "EventData",
"data_dataDetails_name" : "EventName",
"data_dataDetails_id": "1234"
}
}
Code to get to Modified Json:
def main(args: Array[String]) {
...
val data = df.select(children("data", df) ++ $"name" ++ $"time"): _*)
data.printSchema
data.write.partitionBy("data_dataDetails_name").format("csv").save(...)
}
def children(colname: String, df: DataFrame) = {
val parent = df.schema.fields.filter(_.name == colname).head
val fields = parent.dataType match {
case x: StructType => x.fields
case _ => Array.empty[StructField]
}
fields.map(x => col(s"$colname.${x.name}").alias(s"$colname" + s"_" + s"${x.name}"))
}
Since the feature is un-available as of Spark 2.3.1, here's a workaround. Make sure to handle name conflicts between the nested fields and the fields at the root level.
{"date":"20180808","value":{"group":"xxx","team":"yyy"}}
df.select("date","value.group","value.team")
.write
.partitionBy("date","group","team")
.parquet(filenameParquet)
The partitions end up like
date=20180808/group=xxx/team=yyy/part-xxx.parquet

Resources