Dealing with nested JSON in PySpark - apache-spark

I'm loading a JSON file into PySpark:
df = spark.read.json("20220824211022.json")
df.show()
+--------------------+--------------------+--------------------+
| data| includes| meta|
+--------------------+--------------------+--------------------+
|[{961778216070344...|{[{2018-02-09T01:...|{1562543391161741...|
+--------------------+--------------------+--------------------+
The two columns I'm interested in here are data and includes. For data, I ran the following:
df2 = df.withColumn("data", F.explode(F.col("data"))).select("data.*")
df2.show(2)
+-------------------+--------------------+-------------------+--------------+--------------------+
| author_id| created_at| id|public_metrics| text|
+-------------------+--------------------+-------------------+--------------+--------------------+
| 961778216070344705|2022-08-24T20:52:...|1562543391161741312| {0, 0, 0, 2}|With Kaskada, you...|
|1275784834321768451|2022-08-24T20:47:...|1562542031284555777| {2, 0, 0, 0}|Below is a protot...|
+-------------------+--------------------+-------------------+--------------+--------------------+
Which is something I can work with. However I can't do the same with the includes column as it has the {} enclosing the [].
Is there a way for me to deal with this using PySpark?
EDIT:
If you were to look at the includes sections in the JSON file, it looks like:
"includes": {"users": [{"id": "893899303" .... }, ...]},
So ideally in the first table in my question, I'd want the includes to be users, or at least be able to drill down to users

As your includes column is a MapType with key value = "users", you can use the .getItem() to get the array by the key, that is:
df3 = df.withColumn("includes", F.explode(F.col("includes").getItem("users"))).select("includes.*")

Related

How to get the size of a list returned by column in pyspark

name
contact
address
"max"
[{"email": "watson#commerce.gov", "phone": "650-333-3456"}, {"email": "emily#gmail.com", "phone": "238-111-7689"}]
{"city": "Baltimore", "state": "MD"}
"kyle"
[{"email": "johnsmith#yahoo.com", "phone": "425-231-8754"}]
{"city": "Barton", "state": "TN"}
I am working with a dataframe in Pyspark that has a few columns including the two mentioned above. I need to create columns dynamically based on the contact fields.
When I use the "." operator on contact as contact.email I get a list of emails. I need to create separate column for each of the emails.
contact.email0, contact.email1, etc.
I found this code online, which partially does what I want, but I don't completely understand it.
employee_data.select(
'name', *[col('contact.email')[i].alias(f'contact.email{i}') for i in range(2)]).show(truncate=False)
The range is static in this case, but my range could be dynamic. How can I get the size of list to loop through it? I tried size(col('contact.email')) or len(col('contact.email')) but got an error saying the col('column name') object is not iterable.
Desired output something like -
name
contact.email0
contact.email1
max
watson#commerce.gov
emily#gmail.com
kyle
johnsmith#yahoo.com
null
You can get desired output by using pivot function,
# convert contact struct to array of emails by using transform function
# explode the array
# perform pivot
df.select("name", posexplode_outer(expr("transform(contact, c-> c.email)"))) \
.withColumn("email", concat(lit("contact.email"), col("pos"))) \
.groupBy("name").pivot("email").agg(first("col")) \
.show(truncate=False)
+----+-------------------+---------------+
|name|contact.email0 |contact.email1 |
+----+-------------------+---------------+
|kyle|johnsmith#yahoo.com|null |
|max |watson#commerce.gov|emily#gmail.com|
+----+-------------------+---------------+
To understand what the solution you found does, we can print the expression in a shell:
>>> [F.col('contact.email')[i].alias(f'contact.email{i}') for i in range(2)]
[Column<'contact.email[0] AS `contact.email0`'>, Column<'contact.email[1] AS `contact.email1`'>]
Basically, it creates two columns, one for the first element of the array contact.email and one for the second element. That's all there is to it.
SOLUTION 1
Keep this solution. But you need to find the max size of your array first:
max_size = df.select(F.max(F.size("contact"))).first()[0]
df.select('name',
*[F.col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(max_size)])\
.show(truncate=False)
SOLUTION 2
Use posexplode to generate one row per element of the array + a pos column containing the index of the email in the array. Then use a pivot to create the columns you want.
df.select('name', F.posexplode('contact.email').alias('pos', 'email'))\
.withColumn('pos', F.concat(F.lit('contact.email'), 'pos'))\
.groupBy('name')\
.pivot('pos')\
.agg(F.first('email'))\
.show()
Both solutions yield:
+----+-------------------+---------------+
|name|contact.email0 |contact.email1 |
+----+-------------------+---------------+
|max |watson#commerce.gov|emily#gmail.com|
|kyle|johnsmith#yahoo.com|null |
+----+-------------------+---------------+
You can use size or array_length functions to get the length of the list in the contact column, and then use that in the range function to dynamically create columns for each email. Here's an example:
from pyspark.sql.functions import size, array_length
contact_size = size(col('contact'))
employee_data.select(
'name', *[col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(contact_size)]).show(truncate=False)
Or, using array_length:
from pyspark.sql.functions import size, array_length
contact_size = array_length(col('contact'))
employee_data.select(
'name', *[col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(contact_size)]).show(truncate=False)

Use dataframe column value as input to select expression

I have a series of expressions used to map raw JSON data to normalized column data. I'm trying to think of a way to efficiently apply this to every row as there are multiple schemas to consider.
Right now, I have one massive CASE statement (built dynamically) that gets interpreted to SQL like this:
SELECT
CASE
WHEN schema = 'A' THEN CONCAT(get_json_object(payload, '$.FirstName'), ' ', get_json_object(payload, '$.LastName'))
WHEN schema = 'B' THEN get_json_object(payload, '$.Name')
END as name,
CASE
WHEN schema = 'A' THEN get_json_object(payload, '$.Telephone')
WHEN schema = 'B' THEN get_json_object(payload, '$.PhoneNumber')
END as phone_number
This works, I just worry about performance as the number of schemas and columns increases. I want to see if there's another way and here is my idea.
I have a DataFrame expressions_df of valid SparkSQL expressions.
schema
column
column_expression
A
name
CONCAT(get_json_object(payload, '$.FirstName'), ' ', get_json_object(payload, '$.LastName'))
A
phone_number
get_json_object(payload, '$.Telephone')
B
name
get_json_object(payload, '$.Name')
B
phone_number
get_json_object(payload, '$.PhoneNumber')
This DataFrame is used as a lookup table of sorts against a DataFrame raw_df:
schema
payload
A
{"FirstName": "John", "LastName": "Doe", "Telephone": "123-456-7890"}
B
{"Name": "Jane Doe", "PhoneNumber": "123-567-1234"}
I'd like to do something like this where column_expression is passed to F.expr and used to interpret the SQL and return the appropriate value.
from pyspark.sql import functions as F
(
raw_df
.join(expressions_df, 'schema')
.select(
F.expr(column_expression)
)
.dropDuplicates()
)
The desired end result would be something like this so that no matter what the original schema is, the data is transformed to the same standard using the expressions as shown in the SQL or expressions_df.
| name | phone_number |
| -------- | ------------ |
| John Doe | 123-456-7890 |
| Jane Doe | 123-567-1234 |
You can't use directly a DataFrame column value as an expression with expr function. You'll have to collect all the expressions into a python object in order to be able to pass them as parameters to expr.
Here's one way to do it where the expressions are collected into a dict then for each schema we apply a different select expression. Finally, union all the dataframes to get the desired output:
from collections import defaultdict
from functools import reduce
import pyspark.sql.functions as F
exprs = defaultdict(list)
for r in expressions_df.collect():
exprs[r.schema].append(F.expr(r.column_expression).alias(r.column))
schemas = [r.schema for r in raw_df.select("schema").distinct().collect()]
final_df = reduce(DataFrame.union, [raw_df.filter(f"schema='{s}'").select(*exprs[s]) for s in schemas])
final_df.show()
#+--------+------------+
#| name|phone_number|
#+--------+------------+
#|Jane Doe|123-567-1234|
#|John Doe|123-456-7890|
#+--------+------------+

How to parse and explode a list of dictionaries stored as string in pyspark?

I have some data that is stored in CSV. Sample data is available here - https://github.com/PranayMehta/apache-spark/blob/master/data.csv
I read the data using pyspark
df = spark.read.csv("data.csv",header=True)
df.printSchema()
root
|-- freeform_text: string (nullable = true)
|-- entity_object: string (nullable = true)
>>> df.show(truncate=False)
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|freeform_text |entity_object |
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Grapes are good. Bananas are bad.|[{'name': 'Grapes', 'type': 'OTHER', 'salience': '0.8335162997245789', 'sentiment_score': '0.8999999761581421', 'sentiment_magnitude': '0.8999999761581421', 'metadata': {}, 'mentions': {'mention_text': 'Grapes', 'mention_type': 'COMMON'}}, {'name': 'Bananas', 'type': 'OTHER', 'salience': '0.16648370027542114', 'sentiment_score': '-0.8999999761581421', 'sentiment_magnitude': '0.8999999761581421', 'metadata': {}, 'mentions': {'mention_text': 'Bananas', 'mention_type': 'COMMON'}}]|
|the weather is not good today |[{'name': 'weather', 'type': 'OTHER', 'salience': '1.0', 'sentiment_score': '-0.800000011920929', 'sentiment_magnitude': '0.800000011920929', 'metadata': {}, 'mentions': {'mention_text': 'weather', 'mention_type': 'COMMON'}}] |
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Now, I want to explode and parse the fields in the entity_object column in this dataframe. Here is some more know-how on what this column contains -
For every freeform_text stored in the Spark Dataframe, I have written some logic to parse out the entities using google's natural language API. These entities are stores as LIST of DICTIONARIES when I do the computation using pandas. I then convert them to string before storing them to Database.
This CSV is what I read in spark dataframe as 2 columns - freeform_text and entity_object.
The entity_object column as string is actually a LIST of dictionaries. It can be imagined as LIST[ DICT1, DICT2 ] and so on. So, some entity_object rows may have 1 element others may have more than 1 based on the number of entities in the output. For instance in the first row, there are 2 entities - grapes and bananas, whereas in 2nd row there is only entity weather.
I want to explode this entity_object column so that 1 record of freeform_text can be exploded in multiple records.
Here is a screenshot of how I would like my output to be -
This can be a working solution for you - Please do let me if this does not work -
Create the Dataframe here
df_new=spark.createDataFrame([
{
str({'name':'Grapes','type':'OTHER','salience':'0.8335162997245789','sentiment_score':'0.8999999761581421','sentiment_magnitude':'0.8999999761581421','metadata':{},'mentions':{'mention_text':'Grapes','mention_type':'COMMON'}}),
str(
{'name':'weather','type':'OTHER','salience':'1.0','sentiment_score':'-0.800000011920929','sentiment_magnitude':'0.800000011920929','metadata':{},'mentions':{'mention_text':'weather','mention_type':'COMMON'}}
)
},
{
str(
{'name':'banana','type':'OTHER','salience':'1.0','sentiment_score':'-0.800000011920929','sentiment_magnitude':'0.800000011920929','metadata':{},'mentions':{'mention_text':'weather','mention_type':'COMMON'}}
)
}
],T.StringType())
Logic Here
df = df_new.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn('explode_col', F.explode("col"))
df = df.withColumn('col', F.from_json("explode_col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("name", df.col.getItem("name")).withColumn("type", df.col.getItem("type")).withColumn("salience", df.col.getItem("salience")).withColumn("sentiment_score", df.col.getItem("sentiment_score")).withColumn("sentiment_magnitude", df.col.getItem("sentiment_magnitude")).withColumn("mentions", df.col.getItem("mentions"))
df.select("name", "type","salience","sentiment_score","sentiment_magnitude","mentions").show(truncate=False)
Output
+-------+-----+------------------+------------------+-------------------+--------------------------------------------------+
|name |type |salience |sentiment_score |sentiment_magnitude|mentions |
+-------+-----+------------------+------------------+-------------------+--------------------------------------------------+
|weather|OTHER|1.0 |-0.800000011920929|0.800000011920929 |{"mention_text":"weather","mention_type":"COMMON"}|
|Grapes |OTHER|0.8335162997245789|0.8999999761581421|0.8999999761581421 |{"mention_text":"Grapes","mention_type":"COMMON"} |
|banana |OTHER|1.0 |-0.800000011920929|0.800000011920929 |{"mention_text":"weather","mention_type":"COMMON"}|
+-------+-----+------------------+------------------+-------------------+--------------------------------------------------+
Update - Instead of createDataFrame - use spark.read.csv() as below
df_new = spark.read.csv("/FileStore/tables/data.csv", header=True)
df_new.show(truncate=False)
# Logic Here
df = df_new.withColumn('col', F.from_json("entity_object", T.ArrayType(T.StringType())))
df = df.withColumn('explode_col', F.explode("col"))
df = df.withColumn('col', F.from_json("explode_col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("name", df.col.getItem("name")).withColumn("type", df.col.getItem("type")).withColumn("salience", df.col.getItem("salience")).withColumn("sentiment_score", df.col.getItem("sentiment_score")).withColumn("sentiment_magnitude", df.col.getItem("sentiment_magnitude")).withColumn("mentions", df.col.getItem("mentions"))
df.select("freeform_text", "name", "type","salience","sentiment_score","sentiment_magnitude","mentions").show(truncate=False)
+---------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+---------------------------------+-------+-----+-------------------+-------------------+-------------------+--------------------------------------------------+
|freeform_text |name |type |salience |sentiment_score |sentiment_magnitude|mentions |
+---------------------------------+-------+-----+-------------------+-------------------+-------------------+--------------------------------------------------+
|Grapes are good. Bananas are bad.|Grapes |OTHER|0.8335162997245789 |0.8999999761581421 |0.8999999761581421 |{"mention_text":"Grapes","mention_type":"COMMON"} |
|Grapes are good. Bananas are bad.|Bananas|OTHER|0.16648370027542114|-0.8999999761581421|0.8999999761581421 |{"mention_text":"Bananas","mention_type":"COMMON"}|
|the weather is not good today |weather|OTHER|1.0 |-0.800000011920929 |0.800000011920929 |{"mention_text":"weather","mention_type":"COMMON"}|
+---------------------------------+-------+-----+-------------------+-------------------+-------------------+--------------------------------------------------+

How do I Programmatically parsed a fixed width text file in Pyspark?

This post does a great job of showing how parse a fixed width text file into a Spark dataframe with pyspark (pyspark parse text file).
I have several text files I want to parse, but they each have slightly different schemas. Rather than having to write out the same procedure for each one like the previous post suggests, I'd like to write a generic function that can parse a fixed width text file given the widths and column names.
I'm pretty new to pyspark so I'm not sure how to write a select statement where the number of columns, and their types is variable.
Any help would be appreciated!
Say we have a text file like the one in the example thread:
00101292017you1234
00201302017 me5678
in "/tmp/sample.txt". And a dictionary containing for each file name, a list of columns and a list of width:
schema_dict = {
"sample": {
"columns": ["id", "date", "string", "integer"],
"width" : [3, 8, 3, 4]
}
}
We can load the dataframes and split them into columns iteratively, using:
import numpy as np
input_path = "/tmp/"
df_dict = dict()
for file in schema_dict.keys():
df = spark.read.text(input_path + file + ".txt")
start_list = np.cumsum([1] + schema_dict[file]["width"]).tolist()[:-1]
df_dict[file] = df.select(
[
df.value.substr(
start_list[i],
schema_dict[file]["width"][i]
).alias(schema_dict[file]["columns"][i]) for i in range(len(start_list))
]
)
+---+--------+------+-------+
| id| date|string|integer|
+---+--------+------+-------+
|001|01292017| you| 1234|
|002|01302017| me| 5678|
+---+--------+------+-------+

Store aggregate value of a PySpark dataframe column into a variable

I am working with PySpark dataframes here. "test1" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the result is a integer variable but when I try to get max of the same column the result is a dataframe. I would like to understand what operations result in a dataframe and variable. I would also like to know how to store the max of the event date as a variable
Code that results in an integer type:
loop_cnt=test1.select('event_date').distinct().count()
type(loop_cnt)
Code that results in dataframe type:
last_processed_dt=test1.select([max('event_date')])
type(last_processed_dt)
Edited to add a reproducible example:
schema = StructType([StructField("event_date", TimestampType(), True)])
df = sqlContext.createDataFrame([(datetime(2015, 8, 10, 2, 44, 15),),(datetime(2015, 8, 10, 3, 44, 15),)], schema)
Code that returns a dataframe:
last_processed_dt=df.select([max('event_date')])
type(last_processed_dt)
Code that returns a varible:
loop_cnt=df.select('event_date').distinct().count()
type(loop_cnt)
You cannot directly access the values in a dataframe. Dataframe returns a Row Object. Instead Dataframe gives you a option to convert it into a python dictionary. Go through the following example where I will calculate average wordcount:
wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordCountsDF = wordsDF.groupBy(wordsDF['word']).count()
wordCountsDF.show()
Here are the word count results:
+--------+-----+
| word|count|
+--------+-----+
| cat| 2|
| rat| 2|
|elephant| 1|
+--------+-----+
Now I calculate the average of count column apply collect() operation on it. Remember collect() returns a list.Here the list contains one element only.
averageCount = wordCountsDF.groupBy().avg('count').collect()
Result looks something like this.
[Row(avg(count)=1.6666666666666667)]
You cannot access directly the average value using some python variable. You have to convert it into a dictionary to access it.
results={}
for i in averageCount:
results.update(i.asDict())
print results
Our final results look like these:
{'avg(count)': 1.6666666666666667}
Finally you can access average value using:
print results['avg(count)']
1.66666666667
I'm pretty sure df.select([max('event_date')]) returns a DataFrame because there could be more than one row that has the max value in that column. In your particular use case no two rows may have the same value in that column, but it is easy to imagine a case where more than one row can have the same max event_date.
df.select('event_date').distinct().count() returns an integer because it is telling you how many distinct values there are in that particular column. It does NOT tell you which value is the largest.
If you want code to get the max event_date and store it as a variable, try the following max_date = df.select([max('event_date')]).distinct().collect()
Using collect()
import pyspark.sql.functions as sf
distinct_count = df.agg(sf.countDistinct('column_name')).collect()[0][0]
Using first()
import pyspark.sql.functions as sf
distinct_count = df.agg(sf.countDistinct('column_name')).first()[0]
last_processed_dt=df.select([max('event_date')])
to get the max of date, we should try something like
last_processed_dt=df.select([max('event_date').alias("max_date")]).collect()[0]
last_processed_dt["max_date"]
Based on sujit's example.We can actually print the value without iterating/looping by
[Row(avg(count)=1.6666666666666667)] by providing averageCount[0][0].
Note: we are not going through the loop, because it's going to return only one value.
try this
loop_cnt=test1.select('event_date').distinct().count()
var = loop_cnt.collect()[0]
Hope this helps
trainDF.fillna({'Age':trainDF.select('Age').agg(avg('Age')).collect()[0][0]})
What you can try is accessing the collect() function.
As of spark 3.0, you can do the following:
loop_cnt=test1.select('event_date').distinct().count().collect()[0][0]
print(loop_cnt)

Resources