pyspark: Converting string to struct - apache-spark

I have data as follows -
{
"Id": "01d3050e",
"Properties": "{\"choices\":null,\"object\":\"demo\",\"database\":\"pg\",\"timestamp\":\"1581534117303\"}",
"LastUpdated": 1581530000000,
"LastUpdatedBy": "System"
}
Using aws glue, I want to relationalize the "Properties" column but since the datatype is string it can't be done. Converting it to struct, might do it based on reading this blog -
https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/
>>> df.show
<bound method DataFrame.show of DataFrame[Id: string, LastUpdated: bigint, LastUpdatedBy: string, Properties: string]>
>>> df.show()
+--------+-------------+-------------+--------------------+
| Id| LastUpdated|LastUpdatedBy| Properties|
+--------+-------------+-------------+--------------------+
|01d3050e|1581530000000| System|{"choices":null,"...|
+--------+-------------+-------------+--------------------+
How can I un-nested the "properties" column to break it into "choices", "object", "database" and "timestamp" columns, using relationalize transformer or any UDF in pyspark.

Use from_json since the column Properties is a JSON string.
If the schema is the same for all you records you can convert to a struct type by defining the schema like this:
schema = StructType([StructField("choices", StringType(), True),
StructField("object", StringType(), True),
StructField("database", StringType(), True),
StructField("timestamp", StringType(), True)],
)
df.withColumn("Properties", from_json(col("Properties"), schema)).show(truncate=False)
#+--------+-------------+-------------+---------------------------+
#|Id |LastUpdated |LastUpdatedBy|Properties |
#+--------+-------------+-------------+---------------------------+
#|01d3050e|1581530000000|System |[, demo, pg, 1581534117303]|
#+--------+-------------+-------------+---------------------------+
However, if the schema can change from one row to another I'd suggest you to convert it to a Map type instead:
df.withColumn("Properties", from_json(col("Properties"), MapType(StringType(), StringType()))).show(truncate=False)
#+--------+-------------+-------------+------------------------------------------------------------------------+
#|Id |LastUpdated |LastUpdatedBy|Properties |
#+--------+-------------+-------------+------------------------------------------------------------------------+
#|01d3050e|1581530000000|System |[choices ->, object -> demo, database -> pg, timestamp -> 1581534117303]|
#+--------+-------------+-------------+------------------------------------------------------------------------+
You can then access elements of the map using element_at (Spark 2.4+)

Creating your dataframe:
from pyspark.sql import functions as F
list=[["01d3050e","{\"choices\":null,\"object\":\"demo\",\"database\":\"pg\",\"timestamp\":\"1581534117303\"}",1581530000000,"System"]]
df=spark.createDataFrame(list, ['Id','Properties','LastUpdated','LastUpdatedBy'])
df.show(truncate=False)
+--------+----------------------------------------------------------------------------+-------------+-------------+
|Id |Properties |LastUpdated |LastUpdatedBy|
+--------+----------------------------------------------------------------------------+-------------+-------------+
|01d3050e|{"choices":null,"object":"demo","database":"pg","timestamp":"1581534117303"}|1581530000000|System |
+--------+----------------------------------------------------------------------------+-------------+-------------+
Use inbuilt regex, split, and element_at:
No need to use UDF, inbuilt functions are adequate and very much optimized for big data tasks.
df.withColumn("Properties", F.split(F.regexp_replace(F.regexp_replace((F.regexp_replace("Properties",'\{|}',"")),'\:',','),'\"|"',"").cast("string"),','))\
.withColumn("choices", F.element_at("Properties",2))\
.withColumn("object", F.element_at("Properties",4))\
.withColumn("database",F.element_at("Properties",6))\
.withColumn("timestamp",F.element_at("Properties",8).cast('long')).drop("Properties").show()
+--------+-------------+-------------+-------+------+--------+-------------+
| Id| LastUpdated|LastUpdatedBy|choices|object|database| timestamp|
+--------+-------------+-------------+-------+------+--------+-------------+
|01d3050e|1581530000000| System| null| demo| pg|1581534117303|
+--------+-------------+-------------+-------+------+--------+-------------+
root
|-- Id: string (nullable = true)
|-- LastUpdated: long (nullable = true)
|-- LastUpdatedBy: string (nullable = true)
|-- choices: string (nullable = true)
|-- object: string (nullable = true)
|-- database: string (nullable = true)
|-- timestamp: long (nullable = true)

Since I was using AWS Glue service, I ended up using the "Unbox" class to Unboxe the string field in dynamicFrame. Worked well for my use-case.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html
unbox = Unbox.apply(frame = dynamic_dframe, path = "Properties", format="json")

Related

Convert a column of JSON list to a DataFrame

I'm loading a file from a jdbc, which has a JSON object formated as:
[
{
"numero": 1,
"resposta": "A",
"peso": 2
},
{
"numero": 2,
"resposta": "A",
"peso": 1
},
...
]
Its datatype is set as json (postgresql), but when loading in into spark, it gets loaded with newline and tab characters:
I tried using the following schema, which results in null (I imagine thats because I have to iterate through the list, but I'm not sure how to do that):
schema = StructType(
[
StructField("peso", IntegerType(), False),
StructField("numero", IntegerType(), False),
StructField("resposta", StringType(), False)
]
)
questoes.withColumn("questoes", from_json("questoes", schema)).show(truncate=200)
Output:
Desired DataFrame:
numero
resposta
peso
1
A
2
2
A
1
...
Code used to read from the DB:
spark = SparkSession.builder.config(
'spark.driver.extraClassPath', 'C:/Users/vitor/AppData/Roaming/DBeaverData/drivers/maven/maven-central/org.postgresql/postgresql-42.2.25.jar').getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
url = 'jdbc:postgresql://localhost:5432/informacoes_concursos'
properties = {'user': 'postgres', 'password': '123'}
gabaritos = spark.read.jdbc(url, table="gabaritos", properties=properties)
concursos = spark.read.jdbc(url, table="concursos", properties=properties)
Edit: I fixed the newline and tab characters by changing the dtype from json to jsonb.
So there are two issues with your code:
Your json is not a struct with 3 fields, it is a collection of structs with 3 fields. Therefore you need to change the schema and use an ArrayType.
Inside your database, the json data seems to be stored with tabs \t and newlines \n (it is formatted). Spark's from_json function does not seem to be able to parse that. So we need to clean it.
# same as before, but wrapped within an array
schema = ArrayType(
StructType([
StructField("peso", IntegerType(), False),
StructField("numero", IntegerType(), False),
StructField("resposta", StringType(), False)
])
)
result = questoes\
.withColumn("questoes", f.regexp_replace("questoes", "\\s", ""))\
.withColumn("data", f.from_json("questoes", schema))
result.show(truncate=False)
which yields:
+---+---------------------------------------------------------------------------+----------------------+
|id |questoes |data |
+---+---------------------------------------------------------------------------+----------------------+
|1 |[{"numero":1,"resposta":"A","peso":2},{"numero":2,"resposta":"A","peso":1}]|[{2, 1, A}, {1, 2, A}]|
+---+---------------------------------------------------------------------------+----------------------+
and the schema:
result.printSchema()
root
|-- id: long (nullable = true)
|-- questoes: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- peso: integer (nullable = true)
| | |-- numero: integer (nullable = true)
| | |-- resposta: string (nullable = true)
You may drop the questoes column, I just kept it to display the cleansed json.

How to get StructType object out of a StructType in spark java?

I'm working on this spark java application and I wanted to access structtype object in a structtype object. For instance-
When we take schema of spark dataframe it looks something like this-
root
|-- name: struct (nullable = true)
| |-- firstname: string (nullable = true)
| |-- middlename: string (nullable = true)
| |-- lastname: string (nullable = true)
|-- language: string (nullable = true)
|-- fee: integer (nullable = true)
I wanted to grab the name as a structtype so that I can analyze it further. It would make a sort of chain. But the problem is that at the root level or any level, we can only extract structfield out of structtype and not other structtype.
StructType st = df.schema(); --> we get root level structtype
st.fields(); --> give us array of structfields but if I take name as a structfield i will lose all the fields inside it as 'name' is a StructType and I want to have it as it is.
StructType name = out of st --> this is what I want to achieve.
You can use the parameters and methods mentioned in the official documentation:
schema = StructType([StructField('name', StructType([StructField('firstname', StringType()), StructField('middlename', StringType()), StructField('lastname', StringType())])), StructField('language', StringType()), StructField('fee', IntegerType())])
for f in schema.fields:
if (f.name == "name"):
print(f.dataType)
for f2 in f.dataType.fields:
print(f2.name)
[Out]:
StructType([StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('lastname', StringType(), True)])
firstname
middlename
lastname

How can I copy the nullability state of a source Spark Dataframe schema and force it onto a target Spark Dataframe?

I'm using Databricks. Let's say I have two Spark Dataframes (I'm using PySpark):
df_source
df_target
If df_source has the following schema:
root
|-- name: string (nullable = false)
|-- id: long (nullable = false)
|-- age: long (nullable = true)
And df_target has the following schema:
root
|-- name: string (nullable = true)
|-- id: long (nullable = false)
|-- age: long (nullable = false)
How do I efficiently create another Dataframe, df_final where the (nullable = true/false) property from the df_source can be forced onto df_target?
I have tried doing the following:
df_final = spark.createDataFrame(df_target.rdd, schema = df_source.schema)
By this method, I'm able to achieve the desired result but it seems to be taking a long amount of time for the dataset size that I have. For smaller datasets, it works fine. Using the collect() function instead of an rdd conversion is obviously way worse for larger datasets.
I would like to point out that the only thing I want to do here is copy the nullability part from the source schema and change it accordingly in target, for the final dataframe.
Is there a way to do some sort of nullability casting, which works similar to .withColumn() performance wise, without RDD conversion, without explicit column name specification in the code? The column ordering is already aligned between source and target.
Additional context: The reason I need to do this is because I need to write (append) df_final to a Google BigQuery table using the Spark BQ connector. So, even if my Spark Dataframe doesn't have any null values in a column but the nullability property is set to true, the BigQuery table will reject the write operation since that column in the BigQuery table may have the nullable property set to false, and the schema mismatches.
Since you know for a fact that age can't be null, you can coalesce age and a constant literal to create a non-nullable field. For fields where the nullable field has to be convered from false to true, when expression can be used.
from pyspark.sql.types import StructType, StructField, StringType, LongType
from pyspark.sql import functions as F
df_source_schema = StructType([
StructField("name", StringType(), False),
StructField("id", LongType(), False),
StructField("age", LongType(), True),
])
df_target_schema = StructType([
StructField("name", StringType(), True),
StructField("id", LongType(), False),
StructField("age", LongType(), False),
])
df_source = spark.createDataFrame([("a", 1, 18, ), ], df_source_schema)
df_source.printSchema()
"""
root
|-- name: string (nullable = false)
|-- id: long (nullable = false)
|-- age: long (nullable = true)
"""
df_target = spark.createDataFrame([("a", 1, 18), ], df_target_schema)
df_target.printSchema()
"""
root
|-- name: string (nullable = true)
|-- id: long (nullable = false)
|-- age: long (nullable = false)
"""
# Construct selection expression based on the logic described above
target_field_nullable_map = {field.name: field.nullable for field in df_target.schema}
selection_expr = []
for src_field in df_source.schema:
field_name = src_field.name
field_type = src_field.dataType
if target_field_nullable_map[field_name] != src_field.nullable:
if src_field.nullable:
selection_expr.append(F.when(F.col(field_name).isNotNull(), F.col(field_name)).otherwise(F.lit(None)).alias(field_name))
else:
selection_expr.append(F.coalesce(F.col(field_name), F.lit("-1").cast(field_type)).alias(field_name))
else:
selection_expr.append(F.col(field_name))
df_final = df_target.select(*selection_expr)
df_final.printSchema()
"""
root
|-- name: string (nullable = false)
|-- id: long (nullable = false)
|-- age: long (nullable = true)
"""
Why does this work?
For a coalesce expression to be null all it's child expression have to be null as seen from here.
Since lit is a non-null expression when value != null coalesce results in a non-nullable column.
A when expression is nullable is any branch is nullable of if the else expression is nullable as noted here.

How to concatenate nested json in Apache Spark

Can someone let me know where I'm going wrong with my attempt to concatenate a nested JSON field.
I'm using the following code:
df = (df
.withColumn("ingestion_date", current_timestamp())
.withColumn("name", concat(col("name.forename"),
lit(" "), col("name.surname"))))
)
Schema:
root
|-- driverRef: string (nullable = true)
|-- number: integer (nullable = true)
|-- code: string (nullable = true)
|-- forename: string (nullable = true)
|-- surname: string (nullable = true)
|-- dob: date (nullable = true)
As you can see, I'm trying to concatenate forname & surname, so as to provide a full name in the name field. At the present the data looks like the following:
After concatenating the 'name' field there should be one single value e.g. the 'name' field would just show Lewis Hamilton, and like wise for the other values in the 'name' field.
My code produces the following error:
Can't extract value from name#6976: need struct type but got string
It would seem that you have a dataframe that contains a name column containing a json with two values: forename and surname, just like this {"forename": "Lewis", "surname" : "Hamilton"}.
That column, in spark, has a string type. That explains the error you obtain. You could only do name.forename if name were of type struct with a field called forename. That what spark means by need struct type but got string.
You just need to tell spark that this string column is a JSON and how to parse it.
from pyspark.sql.types import StructType, StringType, StructField
from pyspark.sql import functions as f
# initializing data
df = spark.range(1).withColumn('name',
f.lit('{"forename": "Lewis", "surname" : "Hamilton"}'))
df.show(truncate=False)
+---+---------------------------------------------+
|id |name |
+---+---------------------------------------------+
|0 |{"forename": "Lewis", "surname" : "Hamilton"}|
+---+---------------------------------------------+
And parsing that JSON:
json_schema = StructType([
StructField('forename', StringType()),
StructField('surname', StringType())
])
df\
.withColumn('s', f.from_json(f.col('name'), json_schema))\
.withColumn("name", f.concat_ws(" ", f.col("s.forename"), f.col("s.surname")))\
.show()
+---+--------------+-----------------+
| id| name| s|
+---+--------------+-----------------+
| 0|Lewis Hamilton|{Lewis, Hamilton}|
+---+--------------+-----------------+
You may than get rid of s with drop, it contains the parsed struct.

How to parse Nested Json string from DynamoDB table in spark? [duplicate]

I have a Cassandra table that for simplicity looks something like:
key: text
jsonData: text
blobData: blob
I can create a basic data frame for this using spark and the spark-cassandra-connector using:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?
Spark >= 2.4
If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema).
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json}
import collection.JavaConverters._
val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))
df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]().asJava))
Spark >= 2.1
You can use from_json function:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("k", StringType, true), StructField("v", DoubleType, true)
))
df.withColumn("jsonData", from_json($"jsonData", schema))
Spark >= 1.6
You can use get_json_object which takes a column and a path:
import org.apache.spark.sql.functions.get_json_object
val exprs = Seq("k", "v").map(
c => get_json_object($"jsonData", s"$$.$c").alias(c))
df.select($"*" +: exprs: _*)
and extracts fields to individual strings which can be further casted to expected types.
The path argument is expressed using dot syntax, with leading $. denoting document root (since the code above uses string interpolation $ has to be escaped, hence $$.).
Spark <= 1.5:
Is this currently possible?
As far as I know it is not directly possible. You can try something similar to this:
val df = sc.parallelize(Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")
)).toDF("key", "jsonData", "blobData")
I assume that blob field cannot be represented in JSON. Otherwise you cab omit splitting and joining:
import org.apache.spark.sql.Row
val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")
val jsons = sqlContext.read.json(df.drop("blobData").map{
case Row(key: String, json: String) =>
s"""{"key": "$key", "jsonData": $json}"""
})
val parsed = jsons.join(blobs, $"key" === $"bkey").drop("bkey")
parsed.printSchema
// root
// |-- jsonData: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: double (nullable = true)
// |-- key: long (nullable = true)
// |-- blobData: string (nullable = true)
An alternative (cheaper, although more complex) approach is to use an UDF to parse JSON and output a struct or map column. For example something like this:
import net.liftweb.json.parse
case class KV(k: String, v: Int)
val parseJson = udf((s: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(s).extract[KV]
})
val parsed = df.withColumn("parsedJSON", parseJson($"jsonData"))
parsed.show
// +---+--------------------+------------------+----------+
// |key| jsonData| blobData|parsedJSON|
// +---+--------------------+------------------+----------+
// | 1|{"k": "foo", "v":...|some_other_field_1| [foo,1]|
// | 2|{"k": "bar", "v":...|some_other_field_2| [bar,3]|
// +---+--------------------+------------------+----------+
parsed.printSchema
// root
// |-- key: string (nullable = true)
// |-- jsonData: string (nullable = true)
// |-- blobData: string (nullable = true)
// |-- parsedJSON: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: integer (nullable = false)
zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json():
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json(), as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json() will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json() will then yield a lot of null or empty values where the schema returned by schema_of_json() doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json() vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json() and deriving it from all the data using spark.read.json().
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a has a boolean value in one row and a string value in another. The merged schema for a would set its type to string. b would be an integer.
Let's see how the different approaches compare. First, the schema_of_json() approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello" couldn't be parsed as a boolean and returned null, and "b": 22 was just dropped because it wasn't in our schema.
Now with spark.read.json():
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true was cast as a string to match the schema of "a": "hello".
The main downside of using spark.read.json() is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json() against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio to something less than 1.0 in your call to spark.read.json() to look at a subset of the data.
Here are the docs for spark.read.json(): Scala API / Python API
The from_json function is exactly what you're looking for. Your code will look something like:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
//You can define whatever struct type that your json states
val schema = StructType(Seq(
StructField("key", StringType, true),
StructField("value", DoubleType, true)
))
df.withColumn("jsonData", from_json(col("jsonData"), schema))
underlying JSON String is
"{ \"column_name1\":\"value1\",\"column_name2\":\"value2\",\"column_name3\":\"value3\",\"column_name5\":\"value5\"}";
Below is the script to filter the JSON and load the required data in to Cassandra.
sqlContext.read.json(rdd).select("column_name1 or fields name in Json", "column_name2","column_name2")
.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "Table_name", "keyspace" -> "Key_Space_name"))
.mode(SaveMode.Append)
.save()
I use the following
(available since 2.2.0, and i am assuming that your json string column is at column index 0)
def parse(df: DataFrame, spark: SparkSession): DataFrame = {
val stringDf = df.map((value: Row) => value.getString(0), Encoders.STRING)
spark.read.json(stringDf)
}
It will automatically infer the schema in your JSON. Documented here:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameReader.html

Resources