structured streaming - explode json fields into dynamic columns? - apache-spark

I got this dataframe from a Kafka source.
+-----------------------+
| data |
+-----------------------+
| '{ "a": 1, "b": 2 }' |
+-----------------------+
| '{ "b": 3, "d": 4 }' |
+-----------------------+
| '{ "a": 2, "c": 4 }' |
+-----------------------+
I want to transform this into the following data frame:
+---------------------------+
| a | b | c | d |
+---------------------------+
| 1 | 2 | null | null |
+---------------------------+
| null | 3 | null | 4 |
+---------------------------+
| 2 | null | 4 | null |
+---------------------------+
Number of JSON fields may change, so I couldn’t specify a schema for it.
I pretty much got the idea how to do the transformation in spark batch, by using some map and reduce to get a set of JSON keys, then construct new dataframe by using withColumns.
However as far as I've been exploring, there is no map reduce function in structured streaming. How do I achieve this?
UPDATE
I figured out UDF can be utilized to parse string to JSON fields
import simplejson as json
from pyspark.sql.functions import udf
def convert_json(s):
return json.loads(s)
udf_convert_json = udf(convert_json, StructType(<..some schema here..>))
df = df.withColumn('parsed_data', udf_convert_json(df.data))
However since the schema is dynamic I need to get all JSON keys and values existed in df.data for a certain window period to construct a StructType used in udf return type.
In the end, I guess I need to know how to perform a reduce in dataset for a certain window period then use it as a lookup schema in stream transformation.

If you already know all unique keys in your json data, then we can use json_tuple function,
>>> df.show()
+------------------+
| data|
+------------------+
|{ "a": 1, "b": 2 }|
|{ "b": 3, "d": 4 }|
|{ "a": 2, "c": 4 }|
+------------------+
>>> from pyspark.sql import functions as F
>>> df.select(F.json_tuple(df.data,'a','b','c','d')).show()
+----+----+----+----+
| c0| c1| c2| c3|
+----+----+----+----+
| 1| 2|null|null|
|null| 3|null| 4|
| 2|null| 4|null|
+----+----+----+----+
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField("a", StringType()),StructField("b", StringType()),StructField("c",StringType()),StructField("d", StringType())])
>>> df.select(F.from_json(df.data,schema).alias('data')).select(F.col('data.*')).show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
| 1| 2|null|null|
|null| 3|null| 4|
| 2|null| 4|null|
+----+----+----+----+

When you have a dynamic JSON column inside your Pyspark Dataframe, you can use below code to explode it's fields to columns
df2 = df.withColumn('columnx', udf_transform_tojsonstring(df.columnx))
columnx_jsonDF = spark.read.json(df2.rdd.map(lambda row: row.columnx)).drop('_corrupt_record')
df3 = df2.withColumn('columnx', from_json(col('columnx'),columnx_jsonDF.schema))
for c in set(columnx_jsonDF.columns):
df3 = df3.withColumn(f'columnx_{c}',df2[f'columnx.`{c}`'])
explanation:
First we use a UDF to transform our column into a valid JSON string (if it's not already done)
In line 3 we read our column as JSON Dataframe (with inferred schema)
then we read columnx again with from_json() function, passing columnx_jsonDF schema to it
finally we add a column in the main Dataframe for each key inside our JSON Column
This works if we don't know the JSON fields in advance and yet, need to explode it's columns

I guess, you don't need to do much. For e.g.
Bala:~:$ cat myjson.json
{ "a": 1, "b": 2 }
{ "b": 3, "d": 4 }
{ "a": 2, "c": 4 }
>>> df = sqlContext.sql("select * from json.`/Users/Bala/myjson.json`")
>>> df.show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
| 1| 2|null|null|
|null| 3|null| 4|
| 2|null| 4|null|
+----+----+----+----+

Related

Avoid writing of NULL fields present in pyspark dataframe

I have a spark dataframe having the following entries
column1 | column2
"a" | "b"
"x" | "c"
null | "a"
null | "b"
"x" | null
So when I convert it to a glue dynamic frame and write to an S3 bucket in json format the null values are also written.
I don't want to convert the null field to an empty string or number etc. Basically if a field value is null it should not be written. How can I avoid writing the null fields?
You can do something like to .na.fill('') default your values to empty string
df = spark.createDataFrame([("a",), ("b",), ("c",), (None,)], ['col'])
df.show()
+----+
| col|
+----+
| a|
| b|
| c|
|null|
+----+
df.na.fill('').show()
+---+
|col|
+---+
| a|
| b|
| c|
| |
+---+

How to remove several rows in a Spark Dataframe based on the position (not value)?

I want to do some data preprocessing using pyspark and want to remove data at the begining and end of data in dataframe. Let's say I want the first 30% and last 30% data to be removed. I only find possibilities based on values using where and find the first and the last but not for several. Here is the basic example so far with no solution:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("foo").getOrCreate()
cut_factor_start = 0.3 # factor to cut the beginning of the data
cut_factor_stop = 1-cut_factor_start # factor to cut the end of the data
# create pandas dataframe
df = pd.DataFrame({'part':['foo','foo','foo','foo','foo', 'foo'], 'values':[9,1,2,2,6,9]})
# convert to spark dataframe
df = spark.createDataFrame(df)
df.show()
+----+------+
|part|values|
+----+------+
| foo| 9|
| foo| 1|
| foo| 2|
| foo| 2|
| foo| 6|
| foo| 9|
+----+------+
df_length = df.count()
print('length of df: ' + str(df_length))
cut_start = round(df_length * cut_factor_start)
print('start postion to cut: ' + str(cut_start))
cut_stop = round(df_length * (cut_factor_stop))
print('stop postion to cut: ' + str(cut_stop))
length of df: 6
start postion to cut: 2
stop postion to cut: 4
What I want it based on the calculations:
+----+------+
|part|values|
+----+------+
| foo| 1|
| foo| 2|
| foo| 2|
+----+------+
Another way is using between after assigning a row_number:
import pyspark.sql.functions as F
from pyspark.sql import Window
rnum= F.row_number().over(Window.orderBy(F.lit(0)))
output = (df.withColumn('Rnum',rnum)
.filter(F.col("Rnum").between(cut_start, cut_stop)).drop('Rnum'))
output.show()
+----+------+
|part|values|
+----+------+
| foo| 1|
| foo| 2|
| foo| 2|
+----+------+
On Scala, unique "id" column can be added, and then "limit" and "except" functions:
val dfWithIds = df.withColumn("uniqueId", monotonically_increasing_id())
dfWithIds
.limit(stopPostionToCut)
.except(dfWithIds.limit(startPostionToCut - 1))
.drop("uniqueId")

pyspark - attempting to create new column based on the difference of two ArrayType columns

I have a table like so:
+-----+----+-------+-------+
|name | id | msg_a | msg_b |
+-----+----+-------+-------+
| a| 3|[a,b,c]|[c] |
| b| 5|[x,y,z]|[h,x,z]|
| c| 7|[a,x,y]|[j,x,y]|
+-----+----+-------+-------+
I want to add a column so that anything in msg_b but not in msg_a is surfaced.
E.g.
+-----+----+-------+-------+------------+
|name | id | msg_a | msg_b | difference |
+-----+----+-------+-------+------------+
| a| 3|[a,b,c]|[c] |NA |
| b| 5|[x,y,z]|[h,x,z]|[h] |
| c| 7|[a,x,y]|[j,x,y]|[j] |
+-----+----+-------+-------+------------+
Referring to a previous post, I've tried
df.select('msg_b').subtract(df.select('msg_a')).show()
which works, but I need the information as a table, with name and id
Doing this:
df.withColumn("difference", F.col('msg_b').subtract(F.col(''msg_a'))).show(5)
yields an TypeError: 'Column' object is not callable
Not sure if there is a separate function for performing this operation, if I'm missing something glaringly obvious, etc.
You have to use UDF:
from pyspark.sql.functions import *
from pyspark.sql.types import *
#udf(ArrayType(StringType()))
def subtract(xs, ys):
return list(set(xs) - set(ys))
Example
df = sc.parallelize([
(["a", "b", "c"], ["c"]), (["x", "y", "z"], ["h", "x", "z"])
]).toDF(["msg_a", "msg_b"])
df.select(subtract('msg_b', 'msg_a'))
+----------------------+
|subtract(msg_b, msg_a)|
+----------------------+
| []|
| [h]|
+----------------------+

combining multiple rows in Spark dataframe column based on condition

I am trying to combine multiple rows in a spark dataframe based on a condition:
This is the dataframe I have(df):
|username | qid | row_no | text |
---------------------------------
| a | 1 | 1 | this |
| a | 1 | 2 | is |
| d | 2 | 1 | the |
| a | 1 | 3 | text |
| d | 2 | 2 | ball |
I want it to look like this
|username | qid | row_no | text |
---------------------------------------
| a | 1 | 1,2,3 | This is text|
| b | 2 | 1,2 | The ball |
I am using spark 1.5.2 it does not have collect_list function
collect_list showed up only in 1.6.
I'd go through the underlying RDD. Here's how:
data_df.show()
+--------+---+------+----+
|username|qid|row_no|text|
+--------+---+------+----+
| d| 2| 2|ball|
| a| 1| 1|this|
| a| 1| 3|text|
| a| 1| 2| is|
| d| 2| 1| the|
+--------+---+------+----+
Then this
reduced = data_df\
.rdd\
.map(lambda row: ((row[0], row[1]), [(row[2], row[3])]))\
.reduceByKey(lambda x,y: x+y)\
.map(lambda row: (row[0], sorted(row[1], key=lambda text: text[0]))) \
.map(lambda row: (
row[0][0],
row[0][1],
','.join([str(e[0]) for e in row[1]]),
' '.join([str(e[1]) for e in row[1]])
)
)
schema_red = typ.StructType([
typ.StructField('username', typ.StringType(), False),
typ.StructField('qid', typ.IntegerType(), False),
typ.StructField('row_no', typ.StringType(), False),
typ.StructField('text', typ.StringType(), False)
])
df_red = sqlContext.createDataFrame(reduced, schema_red)
df_red.show()
The above produced the following:
+--------+---+------+------------+
|username|qid|row_no| text|
+--------+---+------+------------+
| d| 2| 1,2| the ball|
| a| 1| 1,2,3|this is text|
+--------+---+------+------------+
In pandas
df4 = pd.DataFrame([
['a', 1, 1, 'this'],
['a', 1, 2, 'is'],
['d', 2, 1, 'the'],
['a', 1, 3, 'text'],
['d', 2, 2, 'ball']
], columns=['username', 'qid', 'row_no', 'text'])
df_groupped=df4.sort_values(by=['qid', 'row_no']).groupby(['username', 'qid'])
df3 = pd.DataFrame()
df3['row_no'] = df_groupped.apply(lambda row: ','.join([str(e) for e in row['row_no']]))
df3['text'] = df_groupped.apply(lambda row: ' '.join(row['text']))
df3 = df3.reset_index()
You can apply groupBy on username and qid column then follow by agg() method you can use collect_list() method like this
import pyspark.sql.functions as func
then you will have collect_list()or some other important functions
for detail abput groupBy and agg you can follow this URL.
Hope this solves your problem
Thanks

Aggregating List of Dicts in Spark DataFrame

How can I perform aggregations and analysis on column in a Spark DF that was created from column that contained multiple dictionaries such as the below:
rootKey=[Row(key1='value1', key2='value2', key3='value3'), Row(key1='value1', key2='value2', key3='value3'), Row(key1='value1', key2='value2', key3='value3'), Row(key1='value1', key2='value2', key3='value3')]
Here is an example of what the column looks like:
>>> df.select('column').show(20, False)
+-----------------------------------------------------------------+
|column |
+-----------------------------------------------------------------+
|[[1,1,1], [1,2,6], [1,2,13], [1,3,3]] |
|[[2,1,1], [2,3,6], [2,4,10]] |
|[[1,1,1], [1,1,6], [1,2,1], [2,2,2], [2,3,6], [1,3,7], [2,4,10]] |
An example would be to summarize all of the key values and groupBy a different column.
You need f.explode:
json_file.json:
{"idx":1, "col":[{"k":1,"v1":1,"v2":1},{"k":1,"v1":2,"v2":6},{"k":1,"v1":2,"v2":13},{"k":1,"v1":2,"v2":2}]}
{"idx":2, "col":[{"k":2,"v1":1,"v2":1},{"k":2,"v1":3,"v2":6},{"k":2,"v1":4,"v2":10}]}
from pyspark.sql import functions as f
df = spark.read.load('file:///home/zht/PycharmProjects/test/json_file.json', format='json')
df = df.withColumn('col', f.explode(df['col']))
df = df.groupBy(df['col']['v1']).sum('col.k')
df.show()
# output:
+---------+-----------------+
|col['v1']|sum(col.k AS `k`)|
+---------+-----------------+
| 1| 3|
| 3| 2|
| 2| 3|
| 4| 2|
+---------+-----------------+

Resources