How to parse Nested Json string from DynamoDB table in spark? [duplicate] - apache-spark

I have a Cassandra table that for simplicity looks something like:
key: text
jsonData: text
blobData: blob
I can create a basic data frame for this using spark and the spark-cassandra-connector using:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?

Spark >= 2.4
If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema).
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json}
import collection.JavaConverters._
val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))
df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]().asJava))
Spark >= 2.1
You can use from_json function:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("k", StringType, true), StructField("v", DoubleType, true)
))
df.withColumn("jsonData", from_json($"jsonData", schema))
Spark >= 1.6
You can use get_json_object which takes a column and a path:
import org.apache.spark.sql.functions.get_json_object
val exprs = Seq("k", "v").map(
c => get_json_object($"jsonData", s"$$.$c").alias(c))
df.select($"*" +: exprs: _*)
and extracts fields to individual strings which can be further casted to expected types.
The path argument is expressed using dot syntax, with leading $. denoting document root (since the code above uses string interpolation $ has to be escaped, hence $$.).
Spark <= 1.5:
Is this currently possible?
As far as I know it is not directly possible. You can try something similar to this:
val df = sc.parallelize(Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")
)).toDF("key", "jsonData", "blobData")
I assume that blob field cannot be represented in JSON. Otherwise you cab omit splitting and joining:
import org.apache.spark.sql.Row
val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")
val jsons = sqlContext.read.json(df.drop("blobData").map{
case Row(key: String, json: String) =>
s"""{"key": "$key", "jsonData": $json}"""
})
val parsed = jsons.join(blobs, $"key" === $"bkey").drop("bkey")
parsed.printSchema
// root
// |-- jsonData: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: double (nullable = true)
// |-- key: long (nullable = true)
// |-- blobData: string (nullable = true)
An alternative (cheaper, although more complex) approach is to use an UDF to parse JSON and output a struct or map column. For example something like this:
import net.liftweb.json.parse
case class KV(k: String, v: Int)
val parseJson = udf((s: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(s).extract[KV]
})
val parsed = df.withColumn("parsedJSON", parseJson($"jsonData"))
parsed.show
// +---+--------------------+------------------+----------+
// |key| jsonData| blobData|parsedJSON|
// +---+--------------------+------------------+----------+
// | 1|{"k": "foo", "v":...|some_other_field_1| [foo,1]|
// | 2|{"k": "bar", "v":...|some_other_field_2| [bar,3]|
// +---+--------------------+------------------+----------+
parsed.printSchema
// root
// |-- key: string (nullable = true)
// |-- jsonData: string (nullable = true)
// |-- blobData: string (nullable = true)
// |-- parsedJSON: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: integer (nullable = false)

zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json():
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json(), as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json() will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json() will then yield a lot of null or empty values where the schema returned by schema_of_json() doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json() vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json() and deriving it from all the data using spark.read.json().
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a has a boolean value in one row and a string value in another. The merged schema for a would set its type to string. b would be an integer.
Let's see how the different approaches compare. First, the schema_of_json() approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello" couldn't be parsed as a boolean and returned null, and "b": 22 was just dropped because it wasn't in our schema.
Now with spark.read.json():
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true was cast as a string to match the schema of "a": "hello".
The main downside of using spark.read.json() is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json() against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio to something less than 1.0 in your call to spark.read.json() to look at a subset of the data.
Here are the docs for spark.read.json(): Scala API / Python API

The from_json function is exactly what you're looking for. Your code will look something like:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
//You can define whatever struct type that your json states
val schema = StructType(Seq(
StructField("key", StringType, true),
StructField("value", DoubleType, true)
))
df.withColumn("jsonData", from_json(col("jsonData"), schema))

underlying JSON String is
"{ \"column_name1\":\"value1\",\"column_name2\":\"value2\",\"column_name3\":\"value3\",\"column_name5\":\"value5\"}";
Below is the script to filter the JSON and load the required data in to Cassandra.
sqlContext.read.json(rdd).select("column_name1 or fields name in Json", "column_name2","column_name2")
.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "Table_name", "keyspace" -> "Key_Space_name"))
.mode(SaveMode.Append)
.save()

I use the following
(available since 2.2.0, and i am assuming that your json string column is at column index 0)
def parse(df: DataFrame, spark: SparkSession): DataFrame = {
val stringDf = df.map((value: Row) => value.getString(0), Encoders.STRING)
spark.read.json(stringDf)
}
It will automatically infer the schema in your JSON. Documented here:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameReader.html

Related

How to concatenate nested json in Apache Spark

Can someone let me know where I'm going wrong with my attempt to concatenate a nested JSON field.
I'm using the following code:
df = (df
.withColumn("ingestion_date", current_timestamp())
.withColumn("name", concat(col("name.forename"),
lit(" "), col("name.surname"))))
)
Schema:
root
|-- driverRef: string (nullable = true)
|-- number: integer (nullable = true)
|-- code: string (nullable = true)
|-- forename: string (nullable = true)
|-- surname: string (nullable = true)
|-- dob: date (nullable = true)
As you can see, I'm trying to concatenate forname & surname, so as to provide a full name in the name field. At the present the data looks like the following:
After concatenating the 'name' field there should be one single value e.g. the 'name' field would just show Lewis Hamilton, and like wise for the other values in the 'name' field.
My code produces the following error:
Can't extract value from name#6976: need struct type but got string
It would seem that you have a dataframe that contains a name column containing a json with two values: forename and surname, just like this {"forename": "Lewis", "surname" : "Hamilton"}.
That column, in spark, has a string type. That explains the error you obtain. You could only do name.forename if name were of type struct with a field called forename. That what spark means by need struct type but got string.
You just need to tell spark that this string column is a JSON and how to parse it.
from pyspark.sql.types import StructType, StringType, StructField
from pyspark.sql import functions as f
# initializing data
df = spark.range(1).withColumn('name',
f.lit('{"forename": "Lewis", "surname" : "Hamilton"}'))
df.show(truncate=False)
+---+---------------------------------------------+
|id |name |
+---+---------------------------------------------+
|0 |{"forename": "Lewis", "surname" : "Hamilton"}|
+---+---------------------------------------------+
And parsing that JSON:
json_schema = StructType([
StructField('forename', StringType()),
StructField('surname', StringType())
])
df\
.withColumn('s', f.from_json(f.col('name'), json_schema))\
.withColumn("name", f.concat_ws(" ", f.col("s.forename"), f.col("s.surname")))\
.show()
+---+--------------+-----------------+
| id| name| s|
+---+--------------+-----------------+
| 0|Lewis Hamilton|{Lewis, Hamilton}|
+---+--------------+-----------------+
You may than get rid of s with drop, it contains the parsed struct.

How to modify a dataframe in-place so that its ArrayType column can't be null (nullable = false and containsNull = false)?

Take the following example dataframe:
val df = Seq(Seq("xxx")).toDF("a")
Schema:
root
|-- a: array (nullable = true)
| |-- element: string (containsNull = true)
How can I modify df in-place so that the resulting dataframe is not nullable anywhere, i.e. has the following schema:
root
|-- a: array (nullable = false)
| |-- element: string (containsNull = false)
I understand that I can re-create another dataframe enforcing a non-nullable schema, such as following Change nullable property of column in spark dataframe
spark.createDataFrame(df.rdd, StructType(StructField("a", ArrayType(StringType, false), false) :: Nil))
But this is not an option under structured streaming, so I want it to be some kind of in-place modification.
So the way to achieve this is with a UserDefinedFunction
// Problem setup
val df = Seq(Seq("xxx")).toDF("a")
df.printSchema
root
|-- a: array (nullable = true)
| |-- element: string (containsNull = true)
Onto the solution:
import org.apache.spark.sql.types.{ArrayType, StringType}
import org.apache.spark.sql.functions.{udf, col}
// We define a sub schema with the appropriate data type and null condition
val subSchema = ArrayType(StringType, containsNull = false)
// We create a UDF that applies this sub schema
// while specifying the output of the UDF to be non-nullable
val applyNonNullableSchemaUdf = udf((x:Seq[String]) => x, subSchema).asNonNullable
// We apply the UDF
val newSchemaDF = df.withColumn("a", applyNonNullableSchemaUdf(col("a")))
And there you have it.
// Check new schema
newSchemaDF.printSchema
root
|-- a: array (nullable = false)
| |-- element: string (containsNull = false)
// Check that it actually works
newSchemaDF.show
+-----+
| a|
+-----+
|[xxx]|
+-----+

Spark SQL CSV to JSON with different data types

Currently, I have a csv data like this:
id,key,value
id_1,int_key,1
id_1,string_key,asd
id_1,double_key,null
id_2,double_key,2.0
I'd like to transform these attributes grouped by their id with their corresponding correct data type to json.
I'm expecting to have a json structure like this:
[{
id: "id_1"
attributes: {
int_key: 1,
string_key: "asd"
double_key: null
}
},
id: "id_2"
attributes: {
double_key: 2.0
}]
My current solution is to collect_list with to_json in Spark which looked like this:
SELECT to_json(id, map_from_arrays(collect_list(key), collect_list(value)) as attributes GROUP BY id)
This will work however, I cannot find a way to cast to their correct data types.
[{
id: "id_1"
attributes: {
int_key: "1",
string_key: "asd"
double_key: "null"
}
},
id: "id_2"
attributes: {
double_key: "2.0"
}]
I also need to add support to null values. But I already found a solution for that. I use ignoreNulls option in to_json. So, if I tried to enumerate each attributes and cast them to their corresponding type, I will be including all the attributes defined. I just want to include the attributes of the user defined in the csv file.
By the way, I'm using Spark 2.4.
Python: Here is my PySpark version of the conversion from the scala version. The results are the same.
from pyspark.sql.functions import col, max, struct
df = spark.read.option("header","true").csv("test.csv")
keys = [row.key for row in df.select(col("key")).distinct().collect()]
df2 = df.groupBy("id").pivot("key").agg(max("value"))
df2.show()
df2.printSchema()
for key in keys:
df2 = df2.withColumn(key, col(key).cast(key.split('_')[0]))
df2.show()
df2.printSchema()
df3 = df2.select("id", struct("int_key", "double_key", "string_key").alias("attributes"))
jsonArray = df3.toJSON().collect()
for json in jsonArray: print(json)
Scala: I tried to split each type of value by using the pivot first.
val keys = df.select('key).distinct.rdd.map(r => r(0).toString).collect
val df2 = df.groupBy('id).pivot('key, keys).agg(max('value))
df2.show
df2.printSchema
Then, the DataFrame looks like below:
+----+-------+----------+----------+
| id|int_key|double_key|string_key|
+----+-------+----------+----------+
|id_2| null| 2.0| null|
|id_1| 1| null| asd|
+----+-------+----------+----------+
root
|-- id: string (nullable = true)
|-- int_key: string (nullable = true)
|-- double_key: string (nullable = true)
|-- string_key: string (nullable = true)
where the type of each column is still strings.
To cast it, I have used the foldLeft,
val df3 = keys.foldLeft(df2) { (df, key) => df.withColumn(key, col(key).cast(key.split("_").head)) }
df3.show
df3.printSchema
and the result now have collrect types.
+----+-------+----------+----------+
| id|int_key|double_key|string_key|
+----+-------+----------+----------+
|id_2| null| 2.0| null|
|id_1| 1| null| asd|
+----+-------+----------+----------+
root
|-- id: string (nullable = true)
|-- int_key: integer (nullable = true)
|-- double_key: double (nullable = true)
|-- string_key: string (nullable = true)
Then, you can build your json such as
val df4 = df3.select('id, struct('int_key, 'double_key, 'string_key) as "attributes")
val jsonArray = df4.toJSON.collect
jsonArray.foreach(println)
where the last line is for checking the result that is
{"id":"id_2","attributes":{"double_key":2.0}}
{"id":"id_1","attributes":{"int_key":1,"string_key":"asd"}}

Spark - convert array of JSON Strings to Struct array, filter and concat with root

I am totally new to Spark and I'm writing a pipeline to perform some transformations into a list of audits.
Example of my data:
{
"id": 932522712299,
"ticket_id": 12,
"created_at": "2020-02-14T19:05:16Z",
"author_id": 392401450482,
"events": ["{\"id\": 932522713292, \"type\": \"VoiceComment\", \"public\": false, \"data\": {\"from\": \"11987654321\", \"to\": \"+1987644\"}"],
}
My schema is basically:
root
|-- id: long (nullable = true)
|-- ticket_id: long (nullable = true)
|-- created_at: string (nullable = true)
|-- author_id: long (nullable = true)
|-- events: array (nullable = true)
| |-- element: string (containsNull = true)
My transformations has a few steps:
Split events by type: comments, tags, change or update;
For each event found, I must add ticket_id, author_id and created_at from root;
It must have one output for each event type.
Basically, each object inside event's array is a string JSON because each type has a different structure - the only attribute common between them it's the type.
I have reach my goals doing some terrible work by converting my dataframe to dict using the following code:
audits = list(map(lambda row: row.asDict(), df.collect()))`
comments = []
for audit in audits:
base_info = {'ticket_id': audit['ticket_id'], 'created_at': audit['created_at'], 'author_id': audit['author_id']}
audit['events'] = [json.loads(x) for x in audit['events']]
audit_comments = [
{**x, **base_info}
for x in audit['events']
if x['type'] == "Comment" or x['type'] == "VoiceComment"
]
comments.extend(audit_comments)
Maybe this question sounds lame or lazy but I'm really stuck in simple things like:
how to parse 'events' items to struct?
how to select event by type and add informations from root? Maybe using select syntax?
Any help is appreciated.
Since the events array elements don't have the same structure for all rows, what you can do is convert it to a Map(String, String).
Using from_json function and the schema MapType(StringType(), StringType()):
df = df.withColumn("events", explode("events"))\
.withColumn("events", from_json(col("events"), MapType(StringType(), StringType())))
Then, using element_at (Spark 2.4+), you can get the type like this:
df = df.withColumn("event_type", element_at(col("events"), "type"))
df.printSchema()
#root
#|-- author_id: long (nullable = true)
#|-- created_at: string (nullable = true)
#|-- events: map (nullable = true)
#| |-- key: string
#| |-- value: string (valueContainsNull = true)
#|-- id: long (nullable = true)
#|-- ticket_id: long (nullable = true)
#|-- event_type: string (nullable = true)
Now, you can filter and select as normal columns:
df.filter(col("event_type") == lit("VoiceComment")).show(truncate=False)
#+------------+--------------------+-----------------------------------------------------------------------------------------------------------+------------+---------+------------+
#|author_id |created_at |events |id |ticket_id|event_type |
#+------------+--------------------+-----------------------------------------------------------------------------------------------------------+------------+---------+------------+
#|392401450482|2020-02-14T19:05:16Z|[id -> 932522713292, type -> VoiceComment, public -> false, data -> {"from":"11987654321","to":"+1987644"}]|932522712299|12 |VoiceComment|
#+------------+--------------------+-----------------------------------------------------------------------------------------------------------+------------+---------+------------+
Your code will load the complete events data onto the master node, which has submitted the job. The spark way to process data wants you to create a map reduce job. There are multiple api for this - they create a DAG Plan for the job and the plan is manifested only when calling specific functions like head or show.
A job like this will be distributed to all machines in a cluster.
When working with a dataframe api, a lot can be done with pyspark.sql.functions
Below the same tranformations with spark.sql dataframe api
import pyspark.sql.functions as F
df = df.withColumn('event', F.explode(df.events)).drop(df.events)
df = df.withColumn('event', F.from_json(df.event, 'STRUCT <id: INT, type: STRING, public: Boolean, data: STRUCT<from: STRING, to: STRING>>'))
events = df.where('event.type = "Comment" OR event.type == "VoiceComment"')
events.printSchema()
events.head(100)
When data cannot be processed with sql expressions you can implement a plain user defined function - UDF or Pandas UDF

pyspark: Converting string to struct

I have data as follows -
{
"Id": "01d3050e",
"Properties": "{\"choices\":null,\"object\":\"demo\",\"database\":\"pg\",\"timestamp\":\"1581534117303\"}",
"LastUpdated": 1581530000000,
"LastUpdatedBy": "System"
}
Using aws glue, I want to relationalize the "Properties" column but since the datatype is string it can't be done. Converting it to struct, might do it based on reading this blog -
https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/
>>> df.show
<bound method DataFrame.show of DataFrame[Id: string, LastUpdated: bigint, LastUpdatedBy: string, Properties: string]>
>>> df.show()
+--------+-------------+-------------+--------------------+
| Id| LastUpdated|LastUpdatedBy| Properties|
+--------+-------------+-------------+--------------------+
|01d3050e|1581530000000| System|{"choices":null,"...|
+--------+-------------+-------------+--------------------+
How can I un-nested the "properties" column to break it into "choices", "object", "database" and "timestamp" columns, using relationalize transformer or any UDF in pyspark.
Use from_json since the column Properties is a JSON string.
If the schema is the same for all you records you can convert to a struct type by defining the schema like this:
schema = StructType([StructField("choices", StringType(), True),
StructField("object", StringType(), True),
StructField("database", StringType(), True),
StructField("timestamp", StringType(), True)],
)
df.withColumn("Properties", from_json(col("Properties"), schema)).show(truncate=False)
#+--------+-------------+-------------+---------------------------+
#|Id |LastUpdated |LastUpdatedBy|Properties |
#+--------+-------------+-------------+---------------------------+
#|01d3050e|1581530000000|System |[, demo, pg, 1581534117303]|
#+--------+-------------+-------------+---------------------------+
However, if the schema can change from one row to another I'd suggest you to convert it to a Map type instead:
df.withColumn("Properties", from_json(col("Properties"), MapType(StringType(), StringType()))).show(truncate=False)
#+--------+-------------+-------------+------------------------------------------------------------------------+
#|Id |LastUpdated |LastUpdatedBy|Properties |
#+--------+-------------+-------------+------------------------------------------------------------------------+
#|01d3050e|1581530000000|System |[choices ->, object -> demo, database -> pg, timestamp -> 1581534117303]|
#+--------+-------------+-------------+------------------------------------------------------------------------+
You can then access elements of the map using element_at (Spark 2.4+)
Creating your dataframe:
from pyspark.sql import functions as F
list=[["01d3050e","{\"choices\":null,\"object\":\"demo\",\"database\":\"pg\",\"timestamp\":\"1581534117303\"}",1581530000000,"System"]]
df=spark.createDataFrame(list, ['Id','Properties','LastUpdated','LastUpdatedBy'])
df.show(truncate=False)
+--------+----------------------------------------------------------------------------+-------------+-------------+
|Id |Properties |LastUpdated |LastUpdatedBy|
+--------+----------------------------------------------------------------------------+-------------+-------------+
|01d3050e|{"choices":null,"object":"demo","database":"pg","timestamp":"1581534117303"}|1581530000000|System |
+--------+----------------------------------------------------------------------------+-------------+-------------+
Use inbuilt regex, split, and element_at:
No need to use UDF, inbuilt functions are adequate and very much optimized for big data tasks.
df.withColumn("Properties", F.split(F.regexp_replace(F.regexp_replace((F.regexp_replace("Properties",'\{|}',"")),'\:',','),'\"|"',"").cast("string"),','))\
.withColumn("choices", F.element_at("Properties",2))\
.withColumn("object", F.element_at("Properties",4))\
.withColumn("database",F.element_at("Properties",6))\
.withColumn("timestamp",F.element_at("Properties",8).cast('long')).drop("Properties").show()
+--------+-------------+-------------+-------+------+--------+-------------+
| Id| LastUpdated|LastUpdatedBy|choices|object|database| timestamp|
+--------+-------------+-------------+-------+------+--------+-------------+
|01d3050e|1581530000000| System| null| demo| pg|1581534117303|
+--------+-------------+-------------+-------+------+--------+-------------+
root
|-- Id: string (nullable = true)
|-- LastUpdated: long (nullable = true)
|-- LastUpdatedBy: string (nullable = true)
|-- choices: string (nullable = true)
|-- object: string (nullable = true)
|-- database: string (nullable = true)
|-- timestamp: long (nullable = true)
Since I was using AWS Glue service, I ended up using the "Unbox" class to Unboxe the string field in dynamicFrame. Worked well for my use-case.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html
unbox = Unbox.apply(frame = dynamic_dframe, path = "Properties", format="json")

Resources