How to read complex json array in pyspark df? - apache-spark

I have a json file that has below's structure that I need to read in as pyspark dataframe.
I tried reading in using multiLine option but it doesn't seem to return more data than the columns and datatypes. What am I possibly doing wrong and how can I read in belows'structure?
df = spark.read.format("json").option("inferSchema", "true") \
.option("multiLine", "true") \
.load("/mnt/blob/input/jsonfile.json")
Structure:
[
[
{
"Key": "Val"
},
{
"Key": "Val"
}
],
[
{
"Key": "Val"
},
{
"Key": "Val"
}
]
]

Related

Spark DataFrame change datatype based on JSON value

I'm constructing a dataframe from a JSON file and saving this dataframe to a parquet file. This parquet file is consumed by a PIG script for further processing.
Below is the schema of the JSON file:
{id:"1",
name:"test",
"fields": [
{
"fieldId": "ABC1.0",
"values": [
{
"key": "812320",
"formId": 11100,
"occ": 1,
"attachId": 0
}
]
},
{
"fieldId": "CDE2.0",
"values": [
{
"key": "MA",
"formId": 11100,
"occ": 1,
"attachId": 0
},
{
"key": 23.0,
"formId": 11100,
"occ": 1,
"attachId": 0
}
]
}
]
}
I need to set the data type of the "key" field based on its value. The value of the key could be string, long double, integer.
How Can I achieve this using spark dataframe/dataset.
First of all your content in the json file is wrong. All properties need to be enclosed in double quotes :-

Update the Nested Json with another Nested Json using Python

For example, I have one full set of nested JSON, I need to update this JSON with the latest values from another nested JSON.
Can anyone help me with this?
I want to implement this in Pyspark.
Full Set Json look like this:
{
"email": "abctest#xxx.com",
"firstName": "name01",
"id": 6304,
"surname": "Optional",
"layer01": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"layer02": {
"key1": "value1",
"key2": "value2"
},
"layer03": [
{
"inner_key01": "inner value01"
},
{
"inner_key02": "inner_value02"
}
]
},
"surname": "Required only$uid"
}
LatestJson look like this:
{
"email": "test#xxx.com",
"firstName": "name01",
"surname": "Optional",
"id": 6304,
"layer01": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"layer02": {
"key1": "value1_changedData",
"key2": "value2"
},
"layer03": [
{
"inner_key01": "inner value01"
},
{
"inner_key02": "inner_value02"
}
]
},
"surname": "Required only$uid"
}
In above for id=6304 we have received updates for the layer01.layer02.key1 and emailaddress fileds.
So I need to update these values to full JSON, Kindly help me with this.
You can load the 2 JSON files into Spark data frames and do a left_join to get updates from the latest JSON data :
from pyspark.sql import functions as F
full_json_df = spark.read.json(full_json_path, multiLine=True)
latest_json_df = spark.read.json(latest_json_path, multiLine=True)
updated_df = full_json_df.alias("full").join(
latest_json_df.alias("latest"),
F.col("full.id") == F.col("latest.id"),
"left"
).select(
F.col("full.id"),
*[
F.when(F.col("latest.id").isNotNull(), F.col(f"latest.{c}")).otherwise(F.col(f"full.{c}")).alias(c)
for c in full_json_df.columns if c != 'id'
]
)
updated_df.show(truncate=False)
#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+
#|id |email |firstName|layer01 |surname |
#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+
#|6304|test#xxx.com|name01 |[value1, value2, value3, value4, [value1_changedData, value2], [[inner value01,], [, inner_value02]]]|Optional|
#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+
Update:
If the schema changes between full and latest JSONs, you can load the 2 files into the same data frame (this way the schemas are being merged) and then deduplicate per id:
from pyspark.sql import Window
from pyspark.sql import functions as F
merged_json_df = spark.read.json("/path/to/{full_json.json,latest_json.json}", multiLine=True)
# order priority: latest file then full
w = Window.partitionBy(F.col("id")).orderBy(F.when(F.input_file_name().like('%latest%'), 0).otherwise(1))
updated_df = merged_json_df.withColumn("rn", F.row_number().over(w))\
.filter("rn = 1")\
.drop("rn")
updated_df.show(truncate=False)

Spark can not process recursive avro data

I have avsc schema like below:
{
"name": "address",
"type": [
"null",
{
"type":"record",
"name":"Address",
"namespace":"com.data",
"fields":[
{
"name":"address",
"type":[ "null","com.data.Address"],
"default":null
}
]
}
],
"default": null
}
On loading this data in pyspark:
jsonFormatSchema = open("Address.avsc", "r").read()
spark = SparkSession.builder.appName('abc').getOrCreate()
df = spark.read.format("avro")\
.option("avroSchema", jsonFormatSchema)\
.load("xxx.avro")
I got such exception:
"Found recursive reference in Avro schema, which can not be processed by Spark"
I tried many other configurations, but without any success.
To execute I use with spark-submit:
--packages org.apache.spark:spark-avro_2.12:3.0.1
This is a intended feature, you can take a look at the "issue" :
https://issues.apache.org/jira/browse/SPARK-25718

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

Spark Dataframe Schema:
StructType(
[StructField("a", StringType(), False),
StructField("b", StringType(), True),
StructField("c" , BinaryType(), False),
StructField("d", ArrayType(StringType(), False), True),
StructField("e", TimestampType(), True)
])
When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe.
BigQuery Schema:
[
{
"type": "STRING",
"name": "a",
"mode": "REQUIRED"
},
{
"type": "STRING",
"name": "b",
"mode": "NULLABLE"
},
{
"type": "BYTES",
"name": "c",
"mode": "REQUIRED"
},
{
"fields": [
{
"fields": [
{
"type": "STRING",
"name": "element",
"mode": "NULLABLE"
}
],
"type": "RECORD",
"name": "list",
"mode": "REPEATED"
}
],
"type": "RECORD",
"name": "d",
"mode": "NULLABLE"
},
{
"type": "TIMESTAMP",
"name": "e",
"mode": "NULLABLE"
}
]
Is this something to do with the way spark writes or they way BigQuery reads parquet. Any idea how I can fix this?
This is due to the intermediate file format (parquet by default) that the spark-bigquery connector uses.
The connector first writes the data to parquet files, then loads them to BigQuery using BigQuery Insert API.
If you check the intermediate parquet schema using parquet-tools, you would find something like this the field d (ArrayType(StringType) in Spark)
optional group a (LIST) {
repeated group list {
optional binary element (STRING);
}
}
Now, if you were loading this parquet yourself in BigQuery using bq load or the BigQuery Insert API directly, you could be able to tell BQ to ignore the intermediate fields by enabling parquet_enable_list_inference
Unfortunately, I don't see how to enable this option when using the spark-bigquery connector!
As a workaround, you can try to use orc as the intermediate format.
df
.write
.format("bigquery")
.option("intermediateFormat", "orc")

DataFrame partitionBy on nested columns

I am trying to call partitionBy on a nested field like below:
val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("data.dataDetails.name").parquet(filenameParquet)
I get the below error when I run it. I do see the 'name' listed as the field in the below schema. Is there a different format to specify the column name which is nested?
java.lang.RuntimeException: Partition column data.dataDetails.name not found in schema StructType(StructField(name,StringType,true), StructField(time,StringType,true), StructField(data,StructType(StructField(dataDetails,StructType(StructField(name,StringType,true), StructField(id,StringType,true),true)),true))
This is my json file:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "EventName"
"id": "1234"
}
}
}
This appears to be a known issue listed here: https://issues.apache.org/jira/browse/SPARK-18084
I had this issue as well and to work around it I was able to un-nest the columns on my dataset. My dataset was a little different than your dataset, but here is the strategy...
Original Json:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "EventName"
"id": "1234"
}
}
}
Modified Json:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data_type": "EventData",
"data_dataDetails_name" : "EventName",
"data_dataDetails_id": "1234"
}
}
Code to get to Modified Json:
def main(args: Array[String]) {
...
val data = df.select(children("data", df) ++ $"name" ++ $"time"): _*)
data.printSchema
data.write.partitionBy("data_dataDetails_name").format("csv").save(...)
}
def children(colname: String, df: DataFrame) = {
val parent = df.schema.fields.filter(_.name == colname).head
val fields = parent.dataType match {
case x: StructType => x.fields
case _ => Array.empty[StructField]
}
fields.map(x => col(s"$colname.${x.name}").alias(s"$colname" + s"_" + s"${x.name}"))
}
Since the feature is un-available as of Spark 2.3.1, here's a workaround. Make sure to handle name conflicts between the nested fields and the fields at the root level.
{"date":"20180808","value":{"group":"xxx","team":"yyy"}}
df.select("date","value.group","value.team")
.write
.partitionBy("date","group","team")
.parquet(filenameParquet)
The partitions end up like
date=20180808/group=xxx/team=yyy/part-xxx.parquet

Resources