form_json return null values - apache-spark

I am trying a parse a string column(containing json string) using from_json, but when i show my result dataframe it shows all the value as null. i am using all type as string, so there should not be any type conversion problem, but still final result is null.
i can show my originaldf and it shows the json string.
sample json :
{"type": "mytype", "version": "0.2", "id": "dc771a5f-336e-4f65-be1c-79de1848d859"}
i am reading the json string from file
originaldf = spark.read.option("header",false).schema("message as string").csv(myfilepath)
originaldf shows. it's not showing full value in console(running in local mode)
root
|-- message: string (nullable = true)
{"fields":[{"metadata":{},"name":"message","nullable":true,"type":"string"}],"type":"struct"}
+-----------------+
| message|
+-----------------+
|{"type": "mytype"|
+-----------------+
schema passed to from_json
{
"fields":[
{
"metadata":{
},
"name":"id",
"nullable":true,
"type":"string"
},
{
"metadata":{
},
"name":"version",
"nullable":true,
"type":"string"
},
{
"metadata":{
},
"name":"type",
"nullable":true,
"type":"string"
}
],
"type":"struct"
}
newdf = originaldf.select(from_json("message",schema).alias("parsedjson")).select("parsedjson.*")
newdf.show(), gives output
+----+--------+---------+
|id | version| type |
+----+--------+----------+
|null| null | null |
+----+--------+----------+

This is strange. I have reproduced it and it worked. I used Spark 2.4.3.
from pyspark.sql import *
row = Row(message='''{"type": "mytype", "version": "0.2", "id": "dc771a5f-336e-4f65-be1c-79de1848d859"}''')
df = spark.createDataFrame([row])
>>> df.show()
+--------------------+
| message|
+--------------------+
|{"type": "mytype"...|
+--------------------+
>>> schema = '''
... {
... "fields":[
... {
... "metadata":{
...
... },
... "name":"id",
... "nullable":true,
... "type":"string"
... },
... {
... "metadata":{
...
... },
... "name":"version",
... "nullable":true,
... "type":"string"
... },
... {
... "metadata":{
...
... },
... "name":"type",
... "nullable":true,
... "type":"string"
... }
... ],
... "type":"struct"
... }
... '''
>>> from pyspark.sql.functions import *
>>> newdf = df.select(from_json("message",schema).alias("parsedjson")).select("parsedjson.*")
>>> newdf.show()
+--------------------+-------+------+
| id|version| type|
+--------------------+-------+------+
|dc771a5f-336e-4f6...| 0.2|mytype|
+--------------------+-------+------+

thanks for your help. i was reading in originaldf as .csv, due to which data was not coming in df as complete json(df.show was showing partial data, so it looked like it has loaded full data but df.col().first.getstring(0) showed it was not full json but string till ',' as i was reading csv).when i used an dummy udf to return json string and it worked. –

Related

How to (dynamically) join array with a struct, to get a value from the struct for each element in the array?

I am trying to parse/flatten a JSON data, containing an array and a struct.
For every "Id" in "data_array" column, I need to get the "EstValue" from "data_struct" column. Column name in "data_struct" is the actual id (from "data_array"). Tried my best to use a dynamic join, but getting error "Column is not iterable". Can't we use dynamic join conditions in PySpark, like we can in SQL? Is there any better way for achieving this?
JSON Input file:
{
"data_array": [
{
"id": 1,
"name": "ABC"
},
{
"id": 2,
"name": "DEF"
}
],
"data_struct": {
"1": {
"estimated": {
"value": 123
},
"completed": {
"value": 1234
}
},
"2": {
"estimated": {
"value": 456
},
"completed": {
"value": 4567
}
}
}
}
Desired output:
Id Name EstValue CompValue
1 ABC 123 1234
2 DEF 456 4567
My PySpark code:
from pyspark.sql.functions import *
rawDF = spark.read.json([f"abfss://{pADLSContainer}#{pADLSGen2}.dfs.core.windows.net/{pADLSDirectory}/InputFile.json"], multiLine = "true")
idDF = rawDF.select(explode("data_array").alias("data_array")) \
.select(col("data_array.id").alias("id"))
idDF.show(n=2,vertical=True,truncate=150)
finalDF = idDF.join(rawDF, (idDF.id == rawDF.select(col("data_struct." + idDF.Id))) )
finalDF.show(n=2,vertical=True,truncate=150)
Error:
def __iter__(self): raise TypeError("Column is not iterable")
Self joins create problems. In this case, you can avoid the join.
You could make arrays from both columns, zip them together and use inline to extract into columns. The most difficult part is creating array from "data_struct" column. Maybe there's a better way, but I only could think of first transforming it into map type.
Input:
s = """
{
"data_array": [
{
"id": 1,
"name": "ABC"
},
{
"id": 2,
"name": "DEF"
}
],
"data_struct": {
"1": {
"estimated": {
"value": 123
},
"completed": {
"value": 1234
}
},
"2": {
"estimated": {
"value": 456
},
"completed": {
"value": 4567
}
}
}
}
"""
rawDF = spark.read.json(sc.parallelize([s]), multiLine = "true")
Script:
id = F.transform('data_array', lambda x: x.id).alias('Id')
name = F.transform('data_array', lambda x: x['name']).alias('Name')
map = F.from_json(F.to_json("data_struct"), 'map<string, struct<estimated:struct<value:long>,completed:struct<value:long>>>')
est_val = F.transform(id, lambda x: map[x].estimated.value).alias('EstValue')
comp_val = F.transform(id, lambda x: map[x].completed.value).alias('CompValue')
df = rawDF.withColumn('y', F.arrays_zip(id, name, est_val, comp_val))
df = df.selectExpr("inline(y)")
df.show()
# +---+----+--------+---------+
# | Id|Name|EstValue|CompValue|
# +---+----+--------+---------+
# | 1| ABC| 123| 1234|
# | 2| DEF| 456| 4567|
# +---+----+--------+---------+

pyspark transform json array into multiple columns

I'm using the below code to read data from an api where the payload is in json format using pyspark in azure databricks. All the fields are defined as string but keep running into json_tuple requires that all arguments are strings error.
Schema:
root
|-- Payload: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ActiveDate: string (nullable = true)
| | |-- BusinessId: string (nullable = true)
| | |-- BusinessName: string (nullable = true)
JSON:
{
"Payload":
[
{
"ActiveDate": "2008-11-25",
"BusinessId": "5678",
"BusinessName": "ACL"
},
{
"ActiveDate": "2009-03-22",
"BusinessId": "6789",
"BusinessName": "BCL"
}
]
}
PySpark:
from pyspark.sql import functions as F
df = df.select(F.col('Payload'), F.json_tuple(F.col('Payload'), 'ActiveDate', 'BusinessId', 'BusinessName') \.alias('ActiveDate', 'BusinessId', 'BusinessName'))
df.write.format("delta").mode("overwrite").saveAsTable("delta_payload")
Error:
AnalysisException: cannot resolve 'json_tuple(`Payload`, 'ActiveDate', 'BusinessId', 'BusinessName')' due to data type mismatch: json_tuple requires that all arguments are strings;
From your schema it looks like the JSON is already parsed, so Payload is of ArrayType rather than StringType containing JSON, hence the error.
You probably need explode instead of json_tuple:
>>> from pyspark.sql.functions import explode
>>> df = spark.createDataFrame([{
... "Payload":
... [
... {
... "ActiveDate": "2008-11-25",
... "BusinessId": "5678",
... "BusinessName": "ACL"
... },
... {
... "ActiveDate": "2009-03-22",
... "BusinessId": "6789",
... "BusinessName": "BCL"
... }
... ]
... }])
>>> df.schema
StructType(List(StructField(Payload,ArrayType(MapType(StringType,StringType,true),true),true)))
>>> df.select(explode("Payload").alias("x")).select("x.ActiveDate", "x.BusinessName", "x.BusinessId").show()
+----------+------------+----------+
|ActiveDate|BusinessName|BusinessId|
+----------+------------+----------+
|2008-11-25| ACL| 5678|
|2009-03-22| BCL| 6789|
+----------+------------+----------+

How can I LEFT JOIN two arrays of structs within a single row?

I'm working with "Action Breakdown" data extracted from Facebook's Ads Insights API
Facebook doesn't put the action (# of purchases) and the action_value ($ amount of purchase) in the same column, so I need to JOIN those on my end, based on the identifier of the action (id# + device type in my case).
If each action were simply its own row, it would of course be trivial to JOIN them with SQL. But in this case, I need to JOIN the two structs within each row. What I'm looking to do amounts to a LEFT JOIN across two structs, matched on two columns. Ideally I could do this with SQL alone (not PySpark/Scala/etc).
So far I have tried:
The SparkSQL inline generator. This gives me each action on its own row, but since the parent row in the original dataset doesn't have a unique identifier, there isn't a way to JOIN these structs on a per-row basis. Also tried using inline() on both columns, but only 1 "generator" function can be used at a time.
Using SparkSQL arrays_zip function to combine them. But this doesn't work because the order isn't always the same and they sometimes don't have the same keys.
I considered writing a map function in PySpark. But it seems map functions only identify columns by index and not name, which seems fragile if the columns should change later on (likely when working with 3rd party APIs).
I considered writing a PySpark UDF, which seems like the best option, but requires a permission I do not have (SELECT on anonymous function). If that's truly the best option, I'll try to push for that permission.
To better illustrate: Each row in my dataset has an actions and action_values column with data like this.
actions = [
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.123",
"value": "1"
},
{
"action_device": "desktop", /* Same conversion ID; different device. */
"action_type": "offsite_conversion.custom.321",
"value": "1"
},
{
"action_device": "iphone", /* Same conversion ID; different device. */
"action_type": "offsite_conversion.custom.321",
"value": "2"
}
{
"action_device": "iphone", /* has "actions" but not "actions_values" */
"action_type": "offsite_conversion.custom.789",
"value": "1"
},
]
action_values = [
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.123",
"value": "49.99"
},
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.321",
"value": "19.99"
},
{
"action_device": "iphone",
"action_type": "offsite_conversion.custom.321",
"value": "99.99"
}
]
I would like each row to have both datapoints in a single struct, like this:
my_desired_result = [
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.123",
"count": "1", /* This comes from the "action" struct */
"value": "49.99" /* This comes from the "action_values" struct */
},
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.321",
"count": "1",
"value": "19.99"
},
{
"action_device": "iphone",
"action_type": "offsite_conversion.custom.321",
"count": "2",
"value": "99.99"
},
{
"action_device": "iphone",
"action_type": "offsite_conversion.custom.789",
"count": "1",
"value": null /* NULL because there is no value for conversion#789 AND iphone */
}
]
IIUC, you can try transform and then use filter to find the first matched item from action_values by matching action_device and action_type:
df.printSchema()
root
|-- action_values: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- action_device: string (nullable = true)
| | |-- action_type: string (nullable = true)
| | |-- value: string (nullable = true)
|-- actions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- action_device: string (nullable = true)
| | |-- action_type: string (nullable = true)
| | |-- value: string (nullable = true)
df.createOrReplaceTempView("df_table")
spark.sql("""
SELECT
transform(actions, x -> named_struct(
'action_device', x.action_device,
'action_type', x.action_type,
'count', x.value,
'value', filter(action_values, y -> y.action_device = x.action_device AND y.action_type = x.action_type)[0].value
)) as result
FROM df_table
""").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[desktop, offsite_conversion.custom.123, 1, 49.99], [desktop, offsite_conversion.custom.321, 1, 19.99], [iphone, offsite_conversion.custom.321, 2, 99.99], [iphone, offsite_conversion.custom.789, 1,]]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
UPDATE: in case of the FULL JOIN, you can try the following SQL:
spark.sql("""
SELECT
concat(
/* actions left join action_values with potentially multiple matched values */
flatten(
transform(actions, x ->
transform(
filter(action_values, y -> y.action_device = x.action_device AND y.action_type = x.action_type),
z -> named_struct(
'action_device', x.action_device,
'action_type', x.action_type,
'count', x.value,
'value', z.value
)
)
)
),
/* action_values missing from actions */
transform(
filter(action_values, x -> !exists(actions, y -> x.action_device = y.action_device AND x.action_type = y.action_type)),
z -> named_struct(
'action_device', z.action_device,
'action_type', z.action_type,
'count', NULL,
'value', z.value
)
)
) as result
FROM df_table
""").show(truncate=False)

How to "where" based on the last StructType of a list

Suppose I have a DataFrame of a StructType list column named 'arr', which can be described by the following json,
{
"otherAttribute": "blabla...",
"arr": [
{
"domain": "books",
"others": "blabla..."
}
{
"domain": "music",
"others": "blabla..."
}
]
}
{
"otherAttribute": "blabla...",
"arr": [
{
"domain": "music",
"others": "blabla..."
}
{
"domain": "furniture",
"others": "blabla..."
}
]
}
... ...
We want to filter out records such that the last StructType in "arr" has its "domain" attribute being "music". In the above example, we need to keep the firs record but discard the second record. Need help to write such "where" clause.
The answer is based on this data:
+---------------+----------------------------------------------+
|other_attribute|arr |
+---------------+----------------------------------------------+
|first |[[books, ...], [music, ...]] |
|second |[[books, ...], [music, ...], [furniture, ...]]|
|third |[[football, ...], [soccer, ...]] |
+---------------+----------------------------------------------+
arr here is an array of structs.
Each element of arr has attributes domain and others (filled with ... here).
DataFrame API approach (F is pyspark.sql.functions):
df.filter(
F.col("arr")[F.size(F.col("arr")) - 1]["domain"] == "music"
)
The SQL way:
SELECT
other_attribute,
arr
FROM df
WHERE arr[size(arr) - 1]['domain'] = 'music'
The output table will look like this:
+---------------+----------------------------+
|other_attribute|arr |
+---------------+----------------------------+
|first |[[books, ...], [music, ...]]|
+---------------+----------------------------+
Full code (suggesting using PySpark console):
import pyspark.sql.types as T
import pyspark.sql.functions as F
schema = T.StructType()\
.add("other_attribute", T.StringType())\
.add("arr", T.ArrayType(
T.StructType()
.add("domain", T.StringType())
.add("others", T.StringType())
)
)
df = spark.createDataFrame([
["first", [["books", "..."], ["music", "..."]]],
["second", [["books", "..."], ["music", "..."], ["furniture", "..."]]],
["third", [["football", "..."], ["soccer", "..."]]]
], schema)
filtered = df.filter(
F.col("arr")[F.size(F.col("arr")) - 1]["domain"] == "music"
)
filtered.show(100, False)
df.createOrReplaceTempView("df")
filtered_with_sql = spark.sql("""
SELECT
other_attribute,
arr
FROM df
WHERE arr[size(arr) - 1]['domain'] = 'music'
""")
filtered_with_sql.show(100, False)

Parse nested JSON structure with a vary schema with Spark DataFrame or RDD API

I have many jsons with structure like that
{
"parent_id": "parent_id1",
"devices" : "HERE_IS_STRUCT_SERIALIZED_AS_STRING_SEE BELOW"
}
{
"0x0034" : { "id": "0x0034", "p1": "p1v1", "p2": "p2v1" },
"0xAB34" : { "id": "0xAB34", "p1": "p1v2", "p2": "p2v2" },
"0xCC34" : { "id": "0xCC34", "p1": "p1v3", "p2": "p2v3" },
"0xFFFF" : { "id": "0xFFFF", "p1": "p1v4", "p2": "p2v4" },
....
"0x0023" : { "id": "0x0023", "p1": "p1vN", "p2": "p2vN" },
}
As you can see instead of making array of objects, the telemetry developers serialize every element as a property of object,
also the property names vary depending on id.
Using Spark DataFrame or RDD API, I want to transform it into a table like that
parent_id1, 0x0034, p1v1, p2v1
parent_id1, 0xAB34, p1v2, p2v2
parent_id1, 0xCC34, p1v3, p2v3
parent_id1, 0xFFFF, p1v4, p2v4
parent_id1, 0x0023, p1v5, p2v5
Here is sample data:
{
"parent_1": "parent_v1",
"devices" : "{ \"0x0034\" : { \"id\": \"0x0034\", \"p1\": \"p1v1\", \"p2\": \"p2v1\" }, \"0xAB34\" : { \"id\": \"0xAB34\", \"p1\": \"p1v2\", \"p2\": \"p2v2\" }, \"0xCC34\" : { \"id\": \"0xCC34\", \"p1\": \"p1v3\", \"p2\": \"p2v3\" }, \"0xFFFF\" : { \"id\": \"0xFFFF\", \"p1\": \"p1v4\", \"p2\": \"p2v4\" }, \"0x0023\" : { \"id\": \"0x0023\", \"p1\": \"p1vN\", \"p2\": \"p2vN\" }}"
}
{
"parent_2": "parent_v1",
"devices" : "{ \"0x0045\" : { \"id\": \"0x0045\", \"p1\": \"p1v1\", \"p2\": \"p2v1\" }, \"0xC5C1\" : { \"id\": \"0xC5C1\", \"p1\": \"p1v2\", \"p2\": \"p2v2\" }}"
}
Desired output
parent_id1, 0x0034, p1v1, p2v1
parent_id1, 0xAB34, p1v2, p2v2
parent_id1, 0xCC34, p1v3, p2v3
parent_id1, 0xFFFF, p1v4, p2v4
parent_id1, 0x0023, p1v5, p2v5
parent_id2, 0x0045, p1v1, p2v1
parent_id2, 0xC5C1, p1v2, p2v2
I thought about passing devices as parameter of from_json function and than somehow transform the returned object into a JSON array and then explode it....
But from_json wants schema as input, but the schema tends to vary...
There is probably a more pythonic or sparkian way to do this but this worked for me:
Input Data
data = {
"parent_id": "parent_v1",
"devices" : "{ \"0x0034\" : { \"id\": \"0x0034\", \"p1\": \"p1v1\", \"p2\": \"p2v1\" }, \"0xAB34\" : { \"id\": \"0xAB34\", \"p1\": \"p1v2\", \"p2\": \"p2v2\" }, \"0xCC34\" : { \"id\": \"0xCC34\", \"p1\": \"p1v3\", \"p2\": \"p2v3\" }, \"0xFFFF\" : { \"id\": \"0xFFFF\", \"p1\": \"p1v4\", \"p2\": \"p2v4\" }, \"0x0023\" : { \"id\": \"0x0023\", \"p1\": \"p1vN\", \"p2\": \"p2vN\" }}"
}
Get Dataframe
import json
def get_df_from_json(json_data):
#convert string to json
json_data['devices'] = json.loads(json_data['devices'])
list_of_dicts = []
for device_name, device_details in json_data['devices'].items():
row = {
"parent_id": json_data['parent_id'],
"device": device_name
}
for key in device_details.keys():
row[key] = device_details[key]
list_of_dicts.append(row)
return spark.read.json(sc.parallelize(list_of_dicts), multiLine=True)
display(get_df_from_json(data))
Output
+--------+--------+------+------+-----------+
| device | id | p1 | p2 | parent_id |
+--------+--------+------+------+-----------+
| 0x0034 | 0x0034 | p1v1 | p2v1 | parent_v1 |
| 0x0023 | 0x0023 | p1vN | p2vN | parent_v1 |
| 0xFFFF | 0xFFFF | p1v4 | p2v4 | parent_v1 |
| 0xCC34 | 0xCC34 | p1v3 | p2v3 | parent_v1 |
| 0xAB34 | 0xAB34 | p1v2 | p2v2 | parent_v1 |
+--------+--------+------+------+-----------+

Resources