Pyspark Transform Key value Pairs into Row - apache-spark

[{
key=Username,
value=
},
{
key=Email,
value=
},
{
key=Id,
value=
},
{
key=Organization,
value=
},
{
key=Role,
value=
},
{
key=Address,
value=
},
{
key=Component,
value=
},
{
key=Reason,
value=
},
{
key=Region,
value=
}]
root
|-- headers: array
| |-- element: struct
| | |-- key: string
| | |-- value: string
|-- id: string
My JSON Data looks like above, I need to tranform into this into single row, with Key as Column Name and Value as data.
My data need to have following column:
Username , Email, Id, Organization, Role, Address, Component, Reason, Region
Attached is the schema.
I tried to use explode which converts the key value pair into rows and the data looks like following,
--------------+-----------------------+----------+
|headers |id |
+----------------------------------------------------------------------------------------------------+------------------------------------------+-----------------------+----------+
|{Username, } |8239|
|{Email, } |8239|
|{Id, } |8239|
|{Organization, } |8239|
|{Role, []} |8239|
|{Address, 9} |8239|
|{Component, 9} |8239|
|{Reason, 9} |8239|
|{Region, 9999} |8239|

You can use inline_outer function to flatten array<struct> column, then apply groupby and pivot to get desired result.
from pyspark.sql.functions import first
df.selectExpr("id", "inline_outer(headers)").\
groupby("id").\
pivot("key").\
agg(first("value")).\
show(truncate=False)

Related

RedisJSON add TTL on specific Value in JSON Object

Hey working with redisJSON
NodeJS
package npm Redis 4.3.1
Key (userID):(Country) with values Json
Example
data = {
"info": {
"name":"test",
"email": "test#test,test"
},
"suppliers": {
"s1": 1,
"s2": 22
},
"suppliersCap": {
"s1": 0,
"s2": 10
}
}
redis.json.set('22:AU', '.', data);
now I try to add TTL for 5 minutes on the specific key in the JSON
for example on this key
22:AU .data.suppliersCap.s2, after 5 minutes the cap will be 0;
bit this not works
redis.json.set(22:AU, '.data.suppliersCap.s2', {
EX: 300
});
You cannot set TTL on an inner element of a RedisJSON object.
Note: It can be done only on an entire RedisJSON object.
NX and XX are the only valid modifiers for the command:
redisClient.json.set(
key: string,
path: string,
json: RedisJSON,
options?: NX | XX | undefined
): Promise<"OK" | null>

How to (dynamically) join array with a struct, to get a value from the struct for each element in the array?

I am trying to parse/flatten a JSON data, containing an array and a struct.
For every "Id" in "data_array" column, I need to get the "EstValue" from "data_struct" column. Column name in "data_struct" is the actual id (from "data_array"). Tried my best to use a dynamic join, but getting error "Column is not iterable". Can't we use dynamic join conditions in PySpark, like we can in SQL? Is there any better way for achieving this?
JSON Input file:
{
"data_array": [
{
"id": 1,
"name": "ABC"
},
{
"id": 2,
"name": "DEF"
}
],
"data_struct": {
"1": {
"estimated": {
"value": 123
},
"completed": {
"value": 1234
}
},
"2": {
"estimated": {
"value": 456
},
"completed": {
"value": 4567
}
}
}
}
Desired output:
Id Name EstValue CompValue
1 ABC 123 1234
2 DEF 456 4567
My PySpark code:
from pyspark.sql.functions import *
rawDF = spark.read.json([f"abfss://{pADLSContainer}#{pADLSGen2}.dfs.core.windows.net/{pADLSDirectory}/InputFile.json"], multiLine = "true")
idDF = rawDF.select(explode("data_array").alias("data_array")) \
.select(col("data_array.id").alias("id"))
idDF.show(n=2,vertical=True,truncate=150)
finalDF = idDF.join(rawDF, (idDF.id == rawDF.select(col("data_struct." + idDF.Id))) )
finalDF.show(n=2,vertical=True,truncate=150)
Error:
def __iter__(self): raise TypeError("Column is not iterable")
Self joins create problems. In this case, you can avoid the join.
You could make arrays from both columns, zip them together and use inline to extract into columns. The most difficult part is creating array from "data_struct" column. Maybe there's a better way, but I only could think of first transforming it into map type.
Input:
s = """
{
"data_array": [
{
"id": 1,
"name": "ABC"
},
{
"id": 2,
"name": "DEF"
}
],
"data_struct": {
"1": {
"estimated": {
"value": 123
},
"completed": {
"value": 1234
}
},
"2": {
"estimated": {
"value": 456
},
"completed": {
"value": 4567
}
}
}
}
"""
rawDF = spark.read.json(sc.parallelize([s]), multiLine = "true")
Script:
id = F.transform('data_array', lambda x: x.id).alias('Id')
name = F.transform('data_array', lambda x: x['name']).alias('Name')
map = F.from_json(F.to_json("data_struct"), 'map<string, struct<estimated:struct<value:long>,completed:struct<value:long>>>')
est_val = F.transform(id, lambda x: map[x].estimated.value).alias('EstValue')
comp_val = F.transform(id, lambda x: map[x].completed.value).alias('CompValue')
df = rawDF.withColumn('y', F.arrays_zip(id, name, est_val, comp_val))
df = df.selectExpr("inline(y)")
df.show()
# +---+----+--------+---------+
# | Id|Name|EstValue|CompValue|
# +---+----+--------+---------+
# | 1| ABC| 123| 1234|
# | 2| DEF| 456| 4567|
# +---+----+--------+---------+

pyspark transform json array into multiple columns

I'm using the below code to read data from an api where the payload is in json format using pyspark in azure databricks. All the fields are defined as string but keep running into json_tuple requires that all arguments are strings error.
Schema:
root
|-- Payload: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ActiveDate: string (nullable = true)
| | |-- BusinessId: string (nullable = true)
| | |-- BusinessName: string (nullable = true)
JSON:
{
"Payload":
[
{
"ActiveDate": "2008-11-25",
"BusinessId": "5678",
"BusinessName": "ACL"
},
{
"ActiveDate": "2009-03-22",
"BusinessId": "6789",
"BusinessName": "BCL"
}
]
}
PySpark:
from pyspark.sql import functions as F
df = df.select(F.col('Payload'), F.json_tuple(F.col('Payload'), 'ActiveDate', 'BusinessId', 'BusinessName') \.alias('ActiveDate', 'BusinessId', 'BusinessName'))
df.write.format("delta").mode("overwrite").saveAsTable("delta_payload")
Error:
AnalysisException: cannot resolve 'json_tuple(`Payload`, 'ActiveDate', 'BusinessId', 'BusinessName')' due to data type mismatch: json_tuple requires that all arguments are strings;
From your schema it looks like the JSON is already parsed, so Payload is of ArrayType rather than StringType containing JSON, hence the error.
You probably need explode instead of json_tuple:
>>> from pyspark.sql.functions import explode
>>> df = spark.createDataFrame([{
... "Payload":
... [
... {
... "ActiveDate": "2008-11-25",
... "BusinessId": "5678",
... "BusinessName": "ACL"
... },
... {
... "ActiveDate": "2009-03-22",
... "BusinessId": "6789",
... "BusinessName": "BCL"
... }
... ]
... }])
>>> df.schema
StructType(List(StructField(Payload,ArrayType(MapType(StringType,StringType,true),true),true)))
>>> df.select(explode("Payload").alias("x")).select("x.ActiveDate", "x.BusinessName", "x.BusinessId").show()
+----------+------------+----------+
|ActiveDate|BusinessName|BusinessId|
+----------+------------+----------+
|2008-11-25| ACL| 5678|
|2009-03-22| BCL| 6789|
+----------+------------+----------+

How can I LEFT JOIN two arrays of structs within a single row?

I'm working with "Action Breakdown" data extracted from Facebook's Ads Insights API
Facebook doesn't put the action (# of purchases) and the action_value ($ amount of purchase) in the same column, so I need to JOIN those on my end, based on the identifier of the action (id# + device type in my case).
If each action were simply its own row, it would of course be trivial to JOIN them with SQL. But in this case, I need to JOIN the two structs within each row. What I'm looking to do amounts to a LEFT JOIN across two structs, matched on two columns. Ideally I could do this with SQL alone (not PySpark/Scala/etc).
So far I have tried:
The SparkSQL inline generator. This gives me each action on its own row, but since the parent row in the original dataset doesn't have a unique identifier, there isn't a way to JOIN these structs on a per-row basis. Also tried using inline() on both columns, but only 1 "generator" function can be used at a time.
Using SparkSQL arrays_zip function to combine them. But this doesn't work because the order isn't always the same and they sometimes don't have the same keys.
I considered writing a map function in PySpark. But it seems map functions only identify columns by index and not name, which seems fragile if the columns should change later on (likely when working with 3rd party APIs).
I considered writing a PySpark UDF, which seems like the best option, but requires a permission I do not have (SELECT on anonymous function). If that's truly the best option, I'll try to push for that permission.
To better illustrate: Each row in my dataset has an actions and action_values column with data like this.
actions = [
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.123",
"value": "1"
},
{
"action_device": "desktop", /* Same conversion ID; different device. */
"action_type": "offsite_conversion.custom.321",
"value": "1"
},
{
"action_device": "iphone", /* Same conversion ID; different device. */
"action_type": "offsite_conversion.custom.321",
"value": "2"
}
{
"action_device": "iphone", /* has "actions" but not "actions_values" */
"action_type": "offsite_conversion.custom.789",
"value": "1"
},
]
action_values = [
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.123",
"value": "49.99"
},
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.321",
"value": "19.99"
},
{
"action_device": "iphone",
"action_type": "offsite_conversion.custom.321",
"value": "99.99"
}
]
I would like each row to have both datapoints in a single struct, like this:
my_desired_result = [
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.123",
"count": "1", /* This comes from the "action" struct */
"value": "49.99" /* This comes from the "action_values" struct */
},
{
"action_device": "desktop",
"action_type": "offsite_conversion.custom.321",
"count": "1",
"value": "19.99"
},
{
"action_device": "iphone",
"action_type": "offsite_conversion.custom.321",
"count": "2",
"value": "99.99"
},
{
"action_device": "iphone",
"action_type": "offsite_conversion.custom.789",
"count": "1",
"value": null /* NULL because there is no value for conversion#789 AND iphone */
}
]
IIUC, you can try transform and then use filter to find the first matched item from action_values by matching action_device and action_type:
df.printSchema()
root
|-- action_values: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- action_device: string (nullable = true)
| | |-- action_type: string (nullable = true)
| | |-- value: string (nullable = true)
|-- actions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- action_device: string (nullable = true)
| | |-- action_type: string (nullable = true)
| | |-- value: string (nullable = true)
df.createOrReplaceTempView("df_table")
spark.sql("""
SELECT
transform(actions, x -> named_struct(
'action_device', x.action_device,
'action_type', x.action_type,
'count', x.value,
'value', filter(action_values, y -> y.action_device = x.action_device AND y.action_type = x.action_type)[0].value
)) as result
FROM df_table
""").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[desktop, offsite_conversion.custom.123, 1, 49.99], [desktop, offsite_conversion.custom.321, 1, 19.99], [iphone, offsite_conversion.custom.321, 2, 99.99], [iphone, offsite_conversion.custom.789, 1,]]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
UPDATE: in case of the FULL JOIN, you can try the following SQL:
spark.sql("""
SELECT
concat(
/* actions left join action_values with potentially multiple matched values */
flatten(
transform(actions, x ->
transform(
filter(action_values, y -> y.action_device = x.action_device AND y.action_type = x.action_type),
z -> named_struct(
'action_device', x.action_device,
'action_type', x.action_type,
'count', x.value,
'value', z.value
)
)
)
),
/* action_values missing from actions */
transform(
filter(action_values, x -> !exists(actions, y -> x.action_device = y.action_device AND x.action_type = y.action_type)),
z -> named_struct(
'action_device', z.action_device,
'action_type', z.action_type,
'count', NULL,
'value', z.value
)
)
) as result
FROM df_table
""").show(truncate=False)

How to extract complex JSON structures using Apache Spark 1.4.0 Data Frames

I am using the new Apache Spark version 1.4.0 Data-frames API to extract information from Twitter's Status JSON, mostly focused on the Entities Object - the relevant part to this question is showed below:
{
...
...
"entities": {
"hashtags": [],
"trends": [],
"urls": [],
"user_mentions": [
{
"screen_name": "linobocchini",
"name": "Lino Bocchini",
"id": 187356243,
"id_str": "187356243",
"indices": [ 3, 16 ]
},
{
"screen_name": "jeanwyllys_real",
"name": "Jean Wyllys",
"id": 111123176,
"id_str": "111123176",
"indices": [ 79, 95 ]
}
],
"symbols": []
},
...
...
}
There are several examples on how extract information from primitives types as string, integer, etc - but I couldn't find anything on how to process those kind of complex structures.
I tried the code below but it is still doesn't work, it throws an Exception
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val tweets = sqlContext.read.json("tweets.json")
// this function is just to filter empty entities.user_mentions[] nodes
// some tweets doesn't contains any mentions
import org.apache.spark.sql.functions.udf
val isEmpty = udf((value: List[Any]) => value.isEmpty)
import org.apache.spark.sql._
import sqlContext.implicits._
case class UserMention(id: Long, idStr: String, indices: Array[Long], name: String, screenName: String)
val mentions = tweets.select("entities.user_mentions").
filter(!isEmpty($"user_mentions")).
explode($"user_mentions") {
case Row(arr: Array[Row]) => arr.map { elem =>
UserMention(
elem.getAs[Long]("id"),
elem.getAs[String]("is_str"),
elem.getAs[Array[Long]]("indices"),
elem.getAs[String]("name"),
elem.getAs[String]("screen_name"))
}
}
mentions.first
Exception when I try to call mentions.first:
scala> mentions.first
15/06/23 22:15:06 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 8)
scala.MatchError: [List([187356243,187356243,List(3, 16),Lino Bocchini,linobocchini], [111123176,111123176,List(79, 95),Jean Wyllys,jeanwyllys_real])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:34)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:34)
at scala.Function1$$anonfun$andThen$1.apply(Function1.scala:55)
at org.apache.spark.sql.catalyst.expressions.UserDefinedGenerator.eval(generators.scala:81)
What is wrong here? I understand it is related to the types but I couldn't figure out it yet.
As additional context, the structure mapped automatically is:
scala> mentions.printSchema
root
|-- user_mentions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- id_str: string (nullable = true)
| | |-- indices: array (nullable = true)
| | | |-- element: long (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- screen_name: string (nullable = true)
NOTE 1: I know it is possible to solve this using HiveQL but I would like to use Data-frames once there is so much momentum around it.
SELECT explode(entities.user_mentions) as mentions
FROM tweets
NOTE 2: the UDF val isEmpty = udf((value: List[Any]) => value.isEmpty) is a ugly hack and I'm missing something here, but was the only way I came up to avoid a NPE
here is a solution that works, with just one small hack.
The main idea is to work around the type problem by declaring a List[String] rather than List[Row]:
val mentions = tweets.explode("entities.user_mentions", "mention"){m: List[String] => m}
This creates a second column called "mention" of type "Struct":
| entities| mention|
+--------------------+--------------------+
|[List(),List(),Li...|[187356243,187356...|
|[List(),List(),Li...|[111123176,111123...|
Now do a map() to extract the fields inside mention. The getStruct(1) call gets the value in column 1 of each row:
case class Mention(id: Long, id_str: String, indices: Seq[Int], name: String, screen_name: String)
val mentionsRdd = mentions.map(
row =>
{
val mention = row.getStruct(1)
Mention(mention.getLong(0), mention.getString(1), mention.getSeq[Int](2), mention.getString(3), mention.getString(4))
}
)
And convert the RDD back into a DataFrame:
val mentionsDf = mentionsRdd.toDF()
There you go!
| id| id_str| indices| name| screen_name|
+---------+---------+------------+-------------+---------------+
|187356243|187356243| List(3, 16)|Lino Bocchini| linobocchini|
|111123176|111123176|List(79, 95)| Jean Wyllys|jeanwyllys_real|
Try doing this:
case Row(arr: Seq[Row]) => arr.map { elem =>

Resources