structured streaming - explode json fields into dynamic columns?

structured streaming - explode json fields into dynamic columns? - apache-spark

I got this dataframe from a Kafka source.
+-----------------------+
| data |
+-----------------------+
| '{ "a": 1, "b": 2 }' |
+-----------------------+
| '{ "b": 3, "d": 4 }' |
+-----------------------+
| '{ "a": 2, "c": 4 }' |
+-----------------------+
I want to transform this into the following data frame:
+---------------------------+
| a | b | c | d |
+---------------------------+
| 1 | 2 | null | null |
+---------------------------+
| null | 3 | null | 4 |
+---------------------------+
| 2 | null | 4 | null |
+---------------------------+
Number of JSON fields may change, so I couldn’t specify a schema for it.
I pretty much got the idea how to do the transformation in spark batch, by using some map and reduce to get a set of JSON keys, then construct new dataframe by using withColumns.
However as far as I've been exploring, there is no map reduce function in structured streaming. How do I achieve this?
UPDATE
I figured out UDF can be utilized to parse string to JSON fields
import simplejson as json
from pyspark.sql.functions import udf
def convert_json(s):
return json.loads(s)
udf_convert_json = udf(convert_json, StructType(<..some schema here..>))
df = df.withColumn('parsed_data', udf_convert_json(df.data))
However since the schema is dynamic I need to get all JSON keys and values existed in df.data for a certain window period to construct a StructType used in udf return type.
In the end, I guess I need to know how to perform a reduce in dataset for a certain window period then use it as a lookup schema in stream transformation.

If you already know all unique keys in your json data, then we can use json_tuple function,
>>> df.show()
+------------------+
| data|
+------------------+
|{ "a": 1, "b": 2 }|
|{ "b": 3, "d": 4 }|
|{ "a": 2, "c": 4 }|
+------------------+
>>> from pyspark.sql import functions as F
>>> df.select(F.json_tuple(df.data,'a','b','c','d')).show()
+----+----+----+----+
| c0| c1| c2| c3|
+----+----+----+----+
| 1| 2|null|null|
|null| 3|null| 4|
| 2|null| 4|null|
+----+----+----+----+
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField("a", StringType()),StructField("b", StringType()),StructField("c",StringType()),StructField("d", StringType())])
>>> df.select(F.from_json(df.data,schema).alias('data')).select(F.col('data.*')).show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
| 1| 2|null|null|
|null| 3|null| 4|
| 2|null| 4|null|
+----+----+----+----+

When you have a dynamic JSON column inside your Pyspark Dataframe, you can use below code to explode it's fields to columns
df2 = df.withColumn('columnx', udf_transform_tojsonstring(df.columnx))
columnx_jsonDF = spark.read.json(df2.rdd.map(lambda row: row.columnx)).drop('_corrupt_record')
df3 = df2.withColumn('columnx', from_json(col('columnx'),columnx_jsonDF.schema))
for c in set(columnx_jsonDF.columns):
df3 = df3.withColumn(f'columnx_{c}',df2[f'columnx.`{c}`'])
explanation:
First we use a UDF to transform our column into a valid JSON string (if it's not already done)
In line 3 we read our column as JSON Dataframe (with inferred schema)
then we read columnx again with from_json() function, passing columnx_jsonDF schema to it
finally we add a column in the main Dataframe for each key inside our JSON Column
This works if we don't know the JSON fields in advance and yet, need to explode it's columns

I guess, you don't need to do much. For e.g.
Bala:~:$ cat myjson.json
{ "a": 1, "b": 2 }
{ "b": 3, "d": 4 }
{ "a": 2, "c": 4 }
>>> df = sqlContext.sql("select * from json.`/Users/Bala/myjson.json`")
>>> df.show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
| 1| 2|null|null|
|null| 3|null| 4|
| 2|null| 4|null|
+----+----+----+----+

Related

Avoid writing of NULL fields present in pyspark dataframe

I have a spark dataframe having the following entries
column1 | column2
"a" | "b"
"x" | "c"
null | "a"
null | "b"
"x" | null
So when I convert it to a glue dynamic frame and write to an S3 bucket in json format the null values are also written.
I don't want to convert the null field to an empty string or number etc. Basically if a field value is null it should not be written. How can I avoid writing the null fields?

You can do something like to .na.fill('') default your values to empty string
df = spark.createDataFrame([("a",), ("b",), ("c",), (None,)], ['col'])
df.show()
+----+
| col|
+----+
| a|
| b|
| c|
|null|
+----+
df.na.fill('').show()
+---+
|col|
+---+
| a|
| b|
| c|
| |
+---+

How to remove several rows in a Spark Dataframe based on the position (not value)?

I want to do some data preprocessing using pyspark and want to remove data at the begining and end of data in dataframe. Let's say I want the first 30% and last 30% data to be removed. I only find possibilities based on values using where and find the first and the last but not for several. Here is the basic example so far with no solution:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("foo").getOrCreate()
cut_factor_start = 0.3 # factor to cut the beginning of the data
cut_factor_stop = 1-cut_factor_start # factor to cut the end of the data
# create pandas dataframe
df = pd.DataFrame({'part':['foo','foo','foo','foo','foo', 'foo'], 'values':[9,1,2,2,6,9]})
# convert to spark dataframe
df = spark.createDataFrame(df)
df.show()
+----+------+
|part|values|
+----+------+
| foo| 9|
| foo| 1|
| foo| 2|
| foo| 2|
| foo| 6|
| foo| 9|
+----+------+
df_length = df.count()
print('length of df: ' + str(df_length))
cut_start = round(df_length * cut_factor_start)
print('start postion to cut: ' + str(cut_start))
cut_stop = round(df_length * (cut_factor_stop))
print('stop postion to cut: ' + str(cut_stop))
length of df: 6
start postion to cut: 2
stop postion to cut: 4
What I want it based on the calculations:
+----+------+
|part|values|
+----+------+
| foo| 1|
| foo| 2|
| foo| 2|
+----+------+

Another way is using between after assigning a row_number:
import pyspark.sql.functions as F
from pyspark.sql import Window
rnum= F.row_number().over(Window.orderBy(F.lit(0)))
output = (df.withColumn('Rnum',rnum)
.filter(F.col("Rnum").between(cut_start, cut_stop)).drop('Rnum'))
output.show()
+----+------+
|part|values|
+----+------+
| foo| 1|
| foo| 2|
| foo| 2|
+----+------+

On Scala, unique "id" column can be added, and then "limit" and "except" functions:
val dfWithIds = df.withColumn("uniqueId", monotonically_increasing_id())
dfWithIds
.limit(stopPostionToCut)
.except(dfWithIds.limit(startPostionToCut - 1))
.drop("uniqueId")

pyspark - attempting to create new column based on the difference of two ArrayType columns

I have a table like so:
+-----+----+-------+-------+
|name | id | msg_a | msg_b |
+-----+----+-------+-------+
| a| 3|[a,b,c]|[c] |
| b| 5|[x,y,z]|[h,x,z]|
| c| 7|[a,x,y]|[j,x,y]|
+-----+----+-------+-------+
I want to add a column so that anything in msg_b but not in msg_a is surfaced.
E.g.
+-----+----+-------+-------+------------+
|name | id | msg_a | msg_b | difference |
+-----+----+-------+-------+------------+
| a| 3|[a,b,c]|[c] |NA |
| b| 5|[x,y,z]|[h,x,z]|[h] |
| c| 7|[a,x,y]|[j,x,y]|[j] |
+-----+----+-------+-------+------------+
Referring to a previous post, I've tried
df.select('msg_b').subtract(df.select('msg_a')).show()
which works, but I need the information as a table, with name and id
Doing this:
df.withColumn("difference", F.col('msg_b').subtract(F.col(''msg_a'))).show(5)
yields an TypeError: 'Column' object is not callable
Not sure if there is a separate function for performing this operation, if I'm missing something glaringly obvious, etc.

You have to use UDF:
from pyspark.sql.functions import *
from pyspark.sql.types import *
#udf(ArrayType(StringType()))
def subtract(xs, ys):
return list(set(xs) - set(ys))
Example
df = sc.parallelize([
(["a", "b", "c"], ["c"]), (["x", "y", "z"], ["h", "x", "z"])
]).toDF(["msg_a", "msg_b"])
df.select(subtract('msg_b', 'msg_a'))
+----------------------+
|subtract(msg_b, msg_a)|
+----------------------+
| []|
| [h]|
+----------------------+

combining multiple rows in Spark dataframe column based on condition

I am trying to combine multiple rows in a spark dataframe based on a condition:
This is the dataframe I have(df):
|username | qid | row_no | text |
---------------------------------
| a | 1 | 1 | this |
| a | 1 | 2 | is |
| d | 2 | 1 | the |
| a | 1 | 3 | text |
| d | 2 | 2 | ball |
I want it to look like this
|username | qid | row_no | text |
---------------------------------------
| a | 1 | 1,2,3 | This is text|
| b | 2 | 1,2 | The ball |
I am using spark 1.5.2 it does not have collect_list function

collect_list showed up only in 1.6.
I'd go through the underlying RDD. Here's how:
data_df.show()
+--------+---+------+----+
|username|qid|row_no|text|
+--------+---+------+----+
| d| 2| 2|ball|
| a| 1| 1|this|
| a| 1| 3|text|
| a| 1| 2| is|
| d| 2| 1| the|
+--------+---+------+----+
Then this
reduced = data_df\
.rdd\
.map(lambda row: ((row[0], row[1]), [(row[2], row[3])]))\
.reduceByKey(lambda x,y: x+y)\
.map(lambda row: (row[0], sorted(row[1], key=lambda text: text[0]))) \
.map(lambda row: (
row[0][0],
row[0][1],
','.join([str(e[0]) for e in row[1]]),
' '.join([str(e[1]) for e in row[1]])
)
)
schema_red = typ.StructType([
typ.StructField('username', typ.StringType(), False),
typ.StructField('qid', typ.IntegerType(), False),
typ.StructField('row_no', typ.StringType(), False),
typ.StructField('text', typ.StringType(), False)
])
df_red = sqlContext.createDataFrame(reduced, schema_red)
df_red.show()
The above produced the following:
+--------+---+------+------------+
|username|qid|row_no| text|
+--------+---+------+------------+
| d| 2| 1,2| the ball|
| a| 1| 1,2,3|this is text|
+--------+---+------+------------+
In pandas
df4 = pd.DataFrame([
['a', 1, 1, 'this'],
['a', 1, 2, 'is'],
['d', 2, 1, 'the'],
['a', 1, 3, 'text'],
['d', 2, 2, 'ball']
], columns=['username', 'qid', 'row_no', 'text'])
df_groupped=df4.sort_values(by=['qid', 'row_no']).groupby(['username', 'qid'])
df3 = pd.DataFrame()
df3['row_no'] = df_groupped.apply(lambda row: ','.join([str(e) for e in row['row_no']]))
df3['text'] = df_groupped.apply(lambda row: ' '.join(row['text']))
df3 = df3.reset_index()

You can apply groupBy on username and qid column then follow by agg() method you can use collect_list() method like this
import pyspark.sql.functions as func
then you will have collect_list()or some other important functions
for detail abput groupBy and agg you can follow this URL.
Hope this solves your problem
Thanks

Aggregating List of Dicts in Spark DataFrame

How can I perform aggregations and analysis on column in a Spark DF that was created from column that contained multiple dictionaries such as the below:
rootKey=[Row(key1='value1', key2='value2', key3='value3'), Row(key1='value1', key2='value2', key3='value3'), Row(key1='value1', key2='value2', key3='value3'), Row(key1='value1', key2='value2', key3='value3')]
Here is an example of what the column looks like:
>>> df.select('column').show(20, False)
+-----------------------------------------------------------------+
|column |
+-----------------------------------------------------------------+
|[[1,1,1], [1,2,6], [1,2,13], [1,3,3]] |
|[[2,1,1], [2,3,6], [2,4,10]] |
|[[1,1,1], [1,1,6], [1,2,1], [2,2,2], [2,3,6], [1,3,7], [2,4,10]] |
An example would be to summarize all of the key values and groupBy a different column.

You need f.explode:
json_file.json:
{"idx":1, "col":[{"k":1,"v1":1,"v2":1},{"k":1,"v1":2,"v2":6},{"k":1,"v1":2,"v2":13},{"k":1,"v1":2,"v2":2}]}
{"idx":2, "col":[{"k":2,"v1":1,"v2":1},{"k":2,"v1":3,"v2":6},{"k":2,"v1":4,"v2":10}]}
from pyspark.sql import functions as f
df = spark.read.load('file:///home/zht/PycharmProjects/test/json_file.json', format='json')
df = df.withColumn('col', f.explode(df['col']))
df = df.groupBy(df['col']['v1']).sum('col.k')
df.show()
# output:
+---------+-----------------+
|col['v1']|sum(col.k AS `k`)|
+---------+-----------------+
| 1| 3|
| 3| 2|
| 2| 3|
| 4| 2|
+---------+-----------------+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

structured streaming - explode json fields into dynamic columns? - apache-spark

Related

Avoid writing of NULL fields present in pyspark dataframe

How to remove several rows in a Spark Dataframe based on the position (not value)?

pyspark - attempting to create new column based on the difference of two ArrayType columns

combining multiple rows in Spark dataframe column based on condition

Aggregating List of Dicts in Spark DataFrame

Categories

Resources