pyspark transform json array into multiple columns - python-3.x

I'm using the below code to read data from an api where the payload is in json format using pyspark in azure databricks. All the fields are defined as string but keep running into json_tuple requires that all arguments are strings error.
Schema:
root
|-- Payload: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ActiveDate: string (nullable = true)
| | |-- BusinessId: string (nullable = true)
| | |-- BusinessName: string (nullable = true)
JSON:
{
"Payload":
[
{
"ActiveDate": "2008-11-25",
"BusinessId": "5678",
"BusinessName": "ACL"
},
{
"ActiveDate": "2009-03-22",
"BusinessId": "6789",
"BusinessName": "BCL"
}
]
}
PySpark:
from pyspark.sql import functions as F
df = df.select(F.col('Payload'), F.json_tuple(F.col('Payload'), 'ActiveDate', 'BusinessId', 'BusinessName') \.alias('ActiveDate', 'BusinessId', 'BusinessName'))
df.write.format("delta").mode("overwrite").saveAsTable("delta_payload")
Error:
AnalysisException: cannot resolve 'json_tuple(`Payload`, 'ActiveDate', 'BusinessId', 'BusinessName')' due to data type mismatch: json_tuple requires that all arguments are strings;

From your schema it looks like the JSON is already parsed, so Payload is of ArrayType rather than StringType containing JSON, hence the error.
You probably need explode instead of json_tuple:
>>> from pyspark.sql.functions import explode
>>> df = spark.createDataFrame([{
... "Payload":
... [
... {
... "ActiveDate": "2008-11-25",
... "BusinessId": "5678",
... "BusinessName": "ACL"
... },
... {
... "ActiveDate": "2009-03-22",
... "BusinessId": "6789",
... "BusinessName": "BCL"
... }
... ]
... }])
>>> df.schema
StructType(List(StructField(Payload,ArrayType(MapType(StringType,StringType,true),true),true)))
>>> df.select(explode("Payload").alias("x")).select("x.ActiveDate", "x.BusinessName", "x.BusinessId").show()
+----------+------------+----------+
|ActiveDate|BusinessName|BusinessId|
+----------+------------+----------+
|2008-11-25| ACL| 5678|
|2009-03-22| BCL| 6789|
+----------+------------+----------+

Related

PySpark Mark a Column Nullable: false

I have a structured stream reading from Kafka and am trying to convert the JSON payload using a Struct schema.
{
"fields": [
{
"metadata": {},
"name": "test",
"nullable": true,
"type": {
"containsNull": true,
"elementType": {
"fields": [
{
"metadata": {},
"name": "message",
"nullable": false,
"type": "string"
},
{
"metadata": {},
"name": "recipient_id",
"nullable": true,
"type": "long"
}
],
"type": "struct"
},
"type": "array"
}
},
{
"metadata": {},
"name": "user_id",
"nullable": true,
"type": "long"
}
],
"type": "struct"
}
Converting the json schema to Struct by results in the following.
StructType.fromJson(jsonSchema)
StructType([StructField('test', ArrayType(StructType([StructField('message', StringType(), False), StructField('recipient_id', LongType(), True)]), True), True), StructField('user_id', LongType(), True)])
Converting the payload using this schema results in a data frame schema where the nullable is set to true even though it is set to false in the above schema and passing the null value to the field is not resulting in any errors.
spark_df = spark_df.selectExpr('timestamp', "CAST(value AS STRING)")
spark_df = spark_df.withColumn("value",from_json(col("value"),schemaNew, {"mode": "FAILFAST"}))
spark_df.printSchema()
root
|-- timestamp: timestamp (nullable = true)
|-- value: struct (nullable = true)
| |-- test: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- message: string (nullable = true)
| | | |-- recipient_id: long (nullable = true)
| |-- user_id: long (nullable = true)
How else can we read the schema from a file and apply nullable property and convert the JSON data to proper dataframe?
Since from_json() ignores nullability information (see explanation below), you could try add a filter() right after that will drop such data (where message is null). Or you could try to use createDataFrame() which checks for nullability and merge back if that's an option.
Why from_json() ignores nullability info
Prior to spark 2.3.1, this was happening unexpectedly due to the Jackson parsers producing null regardless of the nullability. There were 2 options to resolve this (from this SPARK issue):
Ignore nullability information in the schema, and assume all fields are nullable. Hence keeping the same behaviour as before due to the limitation of Jackson parsers.
Validate the data, and fail during execution if the data has null.
They went with option 1 because it's "the more performant option and a lot easier to do" and "less invasive" too. Hence, from_json() ignores nullability information set in the schema.

How can I LEFT JOIN two arrays of structs within a single row?

I'm working with "Action Breakdown" data extracted from Facebook's Ads Insights API
Facebook doesn't put the action (# of purchases) and the action_value ($ amount of purchase) in the same column, so I need to JOIN those on my end, based on the identifier of the action (id# + device type in my case).
If each action were simply its own row, it would of course be trivial to JOIN them with SQL. But in this case, I need to JOIN the two structs within each row. What I'm looking to do amounts to a LEFT JOIN across two structs, matched on two columns. Ideally I could do this with SQL alone (not PySpark/Scala/etc).
So far I have tried:
The SparkSQL inline generator. This gives me each action on its own row, but since the parent row in the original dataset doesn't have a unique identifier, there isn't a way to JOIN these structs on a per-row basis. Also tried using inline() on both columns, but only 1 "generator" function can be used at a time.
Using SparkSQL arrays_zip function to combine them. But this doesn't work because the order isn't always the same and they sometimes don't have the same keys.
I considered writing a map function in PySpark. But it seems map functions only identify columns by index and not name, which seems fragile if the columns should change later on (likely when working with 3rd party APIs).
I considered writing a PySpark UDF, which seems like the best option, but requires a permission I do not have (SELECT on anonymous function). If that's truly the best option, I'll try to push for that permission.
To better illustrate: Each row in my dataset has an actions and action_values column with data like this.
actions = [
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.123",
"value": "1"
},
{
"action_device": "desktop", /* Same conversion ID; different device. */
"action_type": "offsite_conversion.custom.321",
"value": "1"
},
{
"action_device": "iphone", /* Same conversion ID; different device. */
"action_type": "offsite_conversion.custom.321",
"value": "2"
}
{
"action_device": "iphone", /* has "actions" but not "actions_values" */
"action_type": "offsite_conversion.custom.789",
"value": "1"
},
]
action_values = [
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.123",
"value": "49.99"
},
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.321",
"value": "19.99"
},
{
"action_device": "iphone",
"action_type": "offsite_conversion.custom.321",
"value": "99.99"
}
]
I would like each row to have both datapoints in a single struct, like this:
my_desired_result = [
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.123",
"count": "1", /* This comes from the "action" struct */
"value": "49.99" /* This comes from the "action_values" struct */
},
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.321",
"count": "1",
"value": "19.99"
},
{
"action_device": "iphone",
"action_type": "offsite_conversion.custom.321",
"count": "2",
"value": "99.99"
},
{
"action_device": "iphone",
"action_type": "offsite_conversion.custom.789",
"count": "1",
"value": null /* NULL because there is no value for conversion#789 AND iphone */
}
]
IIUC, you can try transform and then use filter to find the first matched item from action_values by matching action_device and action_type:
df.printSchema()
root
|-- action_values: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- action_device: string (nullable = true)
| | |-- action_type: string (nullable = true)
| | |-- value: string (nullable = true)
|-- actions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- action_device: string (nullable = true)
| | |-- action_type: string (nullable = true)
| | |-- value: string (nullable = true)
df.createOrReplaceTempView("df_table")
spark.sql("""
SELECT
transform(actions, x -> named_struct(
'action_device', x.action_device,
'action_type', x.action_type,
'count', x.value,
'value', filter(action_values, y -> y.action_device = x.action_device AND y.action_type = x.action_type)[0].value
)) as result
FROM df_table
""").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[desktop, offsite_conversion.custom.123, 1, 49.99], [desktop, offsite_conversion.custom.321, 1, 19.99], [iphone, offsite_conversion.custom.321, 2, 99.99], [iphone, offsite_conversion.custom.789, 1,]]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
UPDATE: in case of the FULL JOIN, you can try the following SQL:
spark.sql("""
SELECT
concat(
/* actions left join action_values with potentially multiple matched values */
flatten(
transform(actions, x ->
transform(
filter(action_values, y -> y.action_device = x.action_device AND y.action_type = x.action_type),
z -> named_struct(
'action_device', x.action_device,
'action_type', x.action_type,
'count', x.value,
'value', z.value
)
)
)
),
/* action_values missing from actions */
transform(
filter(action_values, x -> !exists(actions, y -> x.action_device = y.action_device AND x.action_type = y.action_type)),
z -> named_struct(
'action_device', z.action_device,
'action_type', z.action_type,
'count', NULL,
'value', z.value
)
)
) as result
FROM df_table
""").show(truncate=False)

form_json return null values

I am trying a parse a string column(containing json string) using from_json, but when i show my result dataframe it shows all the value as null. i am using all type as string, so there should not be any type conversion problem, but still final result is null.
i can show my originaldf and it shows the json string.
sample json :
{"type": "mytype", "version": "0.2", "id": "dc771a5f-336e-4f65-be1c-79de1848d859"}
i am reading the json string from file
originaldf = spark.read.option("header",false).schema("message as string").csv(myfilepath)
originaldf shows. it's not showing full value in console(running in local mode)
root
|-- message: string (nullable = true)
{"fields":[{"metadata":{},"name":"message","nullable":true,"type":"string"}],"type":"struct"}
+-----------------+
| message|
+-----------------+
|{"type": "mytype"|
+-----------------+
schema passed to from_json
{
"fields":[
{
"metadata":{
},
"name":"id",
"nullable":true,
"type":"string"
},
{
"metadata":{
},
"name":"version",
"nullable":true,
"type":"string"
},
{
"metadata":{
},
"name":"type",
"nullable":true,
"type":"string"
}
],
"type":"struct"
}
newdf = originaldf.select(from_json("message",schema).alias("parsedjson")).select("parsedjson.*")
newdf.show(), gives output
+----+--------+---------+
|id | version| type |
+----+--------+----------+
|null| null | null |
+----+--------+----------+
This is strange. I have reproduced it and it worked. I used Spark 2.4.3.
from pyspark.sql import *
row = Row(message='''{"type": "mytype", "version": "0.2", "id": "dc771a5f-336e-4f65-be1c-79de1848d859"}''')
df = spark.createDataFrame([row])
>>> df.show()
+--------------------+
| message|
+--------------------+
|{"type": "mytype"...|
+--------------------+
>>> schema = '''
... {
... "fields":[
... {
... "metadata":{
...
... },
... "name":"id",
... "nullable":true,
... "type":"string"
... },
... {
... "metadata":{
...
... },
... "name":"version",
... "nullable":true,
... "type":"string"
... },
... {
... "metadata":{
...
... },
... "name":"type",
... "nullable":true,
... "type":"string"
... }
... ],
... "type":"struct"
... }
... '''
>>> from pyspark.sql.functions import *
>>> newdf = df.select(from_json("message",schema).alias("parsedjson")).select("parsedjson.*")
>>> newdf.show()
+--------------------+-------+------+
| id|version| type|
+--------------------+-------+------+
|dc771a5f-336e-4f6...| 0.2|mytype|
+--------------------+-------+------+
thanks for your help. i was reading in originaldf as .csv, due to which data was not coming in df as complete json(df.show was showing partial data, so it looked like it has loaded full data but df.col().first.getstring(0) showed it was not full json but string till ',' as i was reading csv).when i used an dummy udf to return json string and it worked. –

spark dataframe to Bigquery using simba driver

While trying to write a dataframe to Bigquery using Simba driver. am getting the below exception.below is the dataframe. Have created a table in bigquery with same schema.
df.printSchema
root
|-- empid: integer (nullable = true)
|-- firstname: string (nullable = true)
|-- middle: string (nullable = true)
|-- last: string (nullable = true)
|-- gender: string (nullable = true)
|-- age: double (nullable = true)
|-- weight: integer (nullable = true)
|-- salary: integer (nullable = true)
|-- city: string (nullable = true)
Simba driver is throwing the below error
Caused by: com.simba.googlebigquery.support.exceptions.GeneralException: [Simba][BigQueryJDBCDriver](100032) Error executing query job. Message: 400 Bad Request
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"location" : "q",
"locationType" : "parameter",
"message" : "Syntax error: Unexpected string literal \"empid\" at [1:38]",
"reason" : "invalidQuery"
} ],
"message" : "Syntax error: Unexpected string literal \"empid\" at [1:38]",
"status" : "INVALID_ARGUMENT"
}
... 24 more
below is the code am using for the same :
val url = "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=my_project_id;OAuthType=0;OAuthPvtKeyPath=service_account_jsonfile;OAuthServiceAcctEmail=googleaccount"
df.write.mode(SaveMode.Append).jdbc(url,"orders_dataset.employee",new java.util.Properties)
Please let me know if am missing any other configuration or where am going wrong.
Thanks in advance!
Seems that behavior is caused by Spark, which is sending extra quotas around column names.
To fix this behavior in Spark, you need to add the following code after creating Spark context and before create a dataframe:
JdbcDialects.registerDialect(new JdbcDialect() {
override def canHandle(url: String): Boolean = url.toLowerCase.startsWith("jdbc:bigquery:")
override
def quoteIdentifier(column: String): String = column
})

How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

I am using the new Apache Spark version 1.4.0 Data-frames API to extract information from Twitter's Status JSON, mostly focused on the Entities Object - the relevant part to this question is showed below:
{
...
...
"entities": {
"hashtags": [],
"trends": [],
"urls": [],
"user_mentions": [
{
"screen_name": "linobocchini",
"name": "Lino Bocchini",
"id": 187356243,
"id_str": "187356243",
"indices": [ 3, 16 ]
},
{
"screen_name": "jeanwyllys_real",
"name": "Jean Wyllys",
"id": 111123176,
"id_str": "111123176",
"indices": [ 79, 95 ]
}
],
"symbols": []
},
...
...
}
There are several examples on how extract information from primitives types as string, integer, etc - but I couldn't find anything on how to process those kind of complex structures.
I tried the code below but it is still doesn't work, it throws an Exception
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val tweets = sqlContext.read.json("tweets.json")
// this function is just to filter empty entities.user_mentions[] nodes
// some tweets doesn't contains any mentions
import org.apache.spark.sql.functions.udf
val isEmpty = udf((value: List[Any]) => value.isEmpty)
import org.apache.spark.sql._
import sqlContext.implicits._
case class UserMention(id: Long, idStr: String, indices: Array[Long], name: String, screenName: String)
val mentions = tweets.select("entities.user_mentions").
filter(!isEmpty($"user_mentions")).
explode($"user_mentions") {
case Row(arr: Array[Row]) => arr.map { elem =>
UserMention(
elem.getAs[Long]("id"),
elem.getAs[String]("is_str"),
elem.getAs[Array[Long]]("indices"),
elem.getAs[String]("name"),
elem.getAs[String]("screen_name"))
}
}
mentions.first
Exception when I try to call mentions.first:
scala> mentions.first
15/06/23 22:15:06 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 8)
scala.MatchError: [List([187356243,187356243,List(3, 16),Lino Bocchini,linobocchini], [111123176,111123176,List(79, 95),Jean Wyllys,jeanwyllys_real])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:34)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:34)
at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:55)
at org.apache.spark.sql.catalyst.expressions.UserDefinedGenerator.eval(generators.scala:81)
What is wrong here? I understand it is related to the types but I couldn't figure out it yet.
As additional context, the structure mapped automatically is:
scala> mentions.printSchema
root
|-- user_mentions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- id_str: string (nullable = true)
| | |-- indices: array (nullable = true)
| | | |-- element: long (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- screen_name: string (nullable = true)
NOTE 1: I know it is possible to solve this using HiveQL but I would like to use Data-frames once there is so much momentum around it.
SELECT explode(entities.user_mentions) as mentions
FROM tweets
NOTE 2: the UDF val isEmpty = udf((value: List[Any]) => value.isEmpty) is a ugly hack and I'm missing something here, but was the only way I came up to avoid a NPE
here is a solution that works, with just one small hack.
The main idea is to work around the type problem by declaring a List[String] rather than List[Row]:
val mentions = tweets.explode("entities.user_mentions", "mention"){m: List[String] => m}
This creates a second column called "mention" of type "Struct":
| entities| mention|
+--------------------+--------------------+
|[List(),List(),Li...|[187356243,187356...|
|[List(),List(),Li...|[111123176,111123...|
Now do a map() to extract the fields inside mention. The getStruct(1) call gets the value in column 1 of each row:
case class Mention(id: Long, id_str: String, indices: Seq[Int], name: String, screen_name: String)
val mentionsRdd = mentions.map(
row =>
{
val mention = row.getStruct(1)
Mention(mention.getLong(0), mention.getString(1), mention.getSeq[Int](2), mention.getString(3), mention.getString(4))
}
)
And convert the RDD back into a DataFrame:
val mentionsDf = mentionsRdd.toDF()
There you go!
| id| id_str| indices| name| screen_name|
+---------+---------+------------+-------------+---------------+
|187356243|187356243| List(3, 16)|Lino Bocchini| linobocchini|
|111123176|111123176|List(79, 95)| Jean Wyllys|jeanwyllys_real|
Try doing this:
case Row(arr: Seq[Row]) => arr.map { elem =>

Resources