Pyspark create dictionary within groupby - apache-spark

Is it possible in pyspark to create dictionary within groupBy.agg()? Here is a toy example:
import pyspark
from pyspark.sql import Row
import pyspark.sql.functions as F
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
toy_data = spark.createDataFrame([
Row(id=1, key='a', value="123"),
Row(id=1, key='b', value="234"),
Row(id=1, key='c', value="345"),
Row(id=2, key='a', value="12"),
Row(id=2, key='x', value="23"),
Row(id=2, key='y', value="123")])
toy_data.show()
+---+---+-----+
| id|key|value|
+---+---+-----+
| 1| a| 123|
| 1| b| 234|
| 1| c| 345|
| 2| a| 12|
| 2| x| 23|
| 2| y| 123|
+---+---+-----+
and this is the expected output:
---+------------------------------------
id | key_value
---+------------------------------------
1 | {"a": "123", "b": "234", "c": "345"}
2 | {"a": "12", "x": "23", "y": "123"}
---+------------------------------------
======================================
I tried this but doesn't work.
toy_data.groupBy("id").agg(
F.create_map(col("key"),col("value")).alias("key_value")
)
This yields the following error:
AnalysisException: u"expression '`key`' is neither present in the group by, nor is it an aggregate function....

The agg component has to contain actual aggregation function. One way to approach this is to combine collect_list
Aggregate function: returns a list of objects with duplicates.
struct:
Creates a new struct column.
and map_from_entries
Collection function: Returns a map created from the given array of entries.
This is how you'd do that:
toy_data.groupBy("id").agg(
F.map_from_entries(
F.collect_list(
F.struct("key", "value"))).alias("key_value")
).show(truncate=False)
+---+------------------------------+
|id |key_value |
+---+------------------------------+
|1 |[a -> 123, b -> 234, c -> 345]|
|2 |[a -> 12, x -> 23, y -> 123] |
+---+------------------------------+

For pyspark < 2.4.0 where pyspark.sql.functions.map_from_entries is not available you can use own created udf function
import pyspark.sql.functions as F
from pyspark.sql.types import MapType, StringType
#F.udf(returnType=MapType(StringType(), StringType()))
def map_array(column):
return dict(column)
(toy_data.groupBy("id")
.agg(F.collect_list(F.struct("key", "value")).alias("key_value"))
.withColumn('key_value', map_array('key_value'))
.show(truncate=False))
+---+------------------------------+
|id |key_value |
+---+------------------------------+
|1 |[a -> 123, b -> 234, c -> 345]|
|2 |[x -> 23, a -> 12, y -> 123] |
+---+------------------------------+

Related

PySpark - create multiple aggregative map columns without using UDF or join

I have a huge dataframe that looks similar to this:
+----+-------+-------+-----+
|name|level_A|level_B|hours|
+----+-------+-------+-----+
| Bob| 10| 3| 5|
| Bob| 10| 3| 15|
| Bob| 20| 3| 25|
| Sue| 30| 3| 35|
| Sue| 30| 7| 45|
+----+-------+-------+-----+
My desired output:
+----+--------------------+------------------+
|name| map_level_A| map_level_B|
+----+--------------------+------------------+
| Bob|{10 -> 20, 20 -> 25}| {3 -> 45}|
| Sue| {30 -> 80}|{7 -> 45, 3 -> 35}|
+----+--------------------+------------------+
Meaning, group by name, adding 2 MapType columns that map level_A and level_B to the sum of hours.
I know I can get that output using an UDF or a join operation.
However, in practice, the data is very big, and it's not 2 map columns, but rather tens of them, so join/UDF are just too costly.
Is there a more efficient way to do that?
You could consider using Window functions. You'll need a windowspec for each level_X partitioned by both name and level_X to calculate the sum of hours. Then group by name and create map from array of structs:
from pyspark.sql import Window
import pyspark.sql.functions as F
df = spark.createDataFrame([("Bob", 10, 3, 5), ("Bob", 10, 3, 15), ("Bob", 20, 3, 25),
("Sue", 30, 3, 35),("Sue", 30, 7, 45), ],
["name", "level_A", "level_B", "hours"])
wla = Window.partitionBy("name", "level_A")
wlb = Window.partitionBy("name", "level_B")
result = df.withColumn("hours_A", F.sum("hours").over(wla)) \
.withColumn("hours_B", F.sum("hours").over(wlb)) \
.groupBy("name") \
.agg(
F.map_from_entries(
F.collect_set(F.struct(F.col("level_A"), F.col("hours_A")))
).alias("map_level_A"),
F.map_from_entries(
F.collect_set(F.struct(F.col("level_B"), F.col("hours_B")))
).alias("map_level_B")
)
result.show()
#+----+--------------------+------------------+
#|name| map_level_A| map_level_B|
#+----+--------------------+------------------+
#| Sue| {30 -> 80}|{3 -> 35, 7 -> 45}|
#| Bob|{10 -> 20, 20 -> 25}| {3 -> 45}|
#+----+--------------------+------------------+

How do I use multiple conditions with pyspark.sql.funtions.when() from a dict?

I want to generate a when clause based on values in a dict. Its very similar to what's being done How do I use multiple conditions with pyspark.sql.funtions.when()?
Only I want to pass a dict of cols and values
Let's say I have a dict:
{
'employed': 'Y',
'athlete': 'N'
}
I want to use that dict to generate the equivalent of:
df.withColumn("call_person",when((col("employed") == "Y") & (col("athlete") == "N"), "Y")
So the end result is:
+---+-----------+--------+-------+
| id|call_person|employed|athlete|
+---+-----------+--------+-------+
| 1| Y | Y | N |
| 2| N | Y | Y |
| 3| N | N | N |
+---+-----------+--------+-------+
Note part of the reason I want to do it programmatically is I have different length dicts (number of conditions)
Use reduce() function:
from functools import reduce
from pyspark.sql.functions import when, col
# dictionary
d = {
'employed': 'Y',
'athlete': 'N'
}
# set up the conditions, multiple conditions merged with `&`
cond = reduce(lambda x,y: x&y, [ col(c) == v for c,v in d.items() if c in df.columns ])
# set up the new column
df.withColumn("call_person", when(cond, "Y").otherwise("N")).show()
+---+--------+-------+-----------+
| id|employed|athlete|call_person|
+---+--------+-------+-----------+
| 1| Y| N| Y|
| 2| Y| Y| N|
| 3| N| N| N|
+---+--------+-------+-----------+
you can access dictionary items directly also:
dict ={
'code': 'b',
'amt': '4'
}
list = [(1, 'code'),(1,'amt')]
df=spark.createDataFrame(list, ['id', 'dict_key'])
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
user_func = udf (lambda x: dict.get(x), StringType())
newdf = df.withColumn('new_column',user_func(df.dict_key))
>>> newdf.show();
+---+--------+----------+
| id|dict_key|new_column|
+---+--------+----------+
| 1| code| b|
| 1| amt| 4|
+---+--------+----------+
or broadcasting a dictionary
broadcast_dict = sc.broadcast(dict)
def my_func(key):
return broadcast_dict.value.get(key)
new_my_func = udf(my_func, StringType())
newdf = df.withColumn('new_column',new_my_func(df.dict_key))
>>> newdf.show();
+---+--------+----------+
| id|dict_key|new_column|
+---+--------+----------+
| 1| code| b|
| 1| amt| 4|
+---+--------+----------+

Spark struct represented by OneHotEncoder

I have a data frame with two columns,
+---+-------+
| id| fruit|
+---+-------+
| 0| apple|
| 1| banana|
| 2|coconut|
| 1| banana|
| 2|coconut|
+---+-------+
also I have a universal List with all the items,
fruitList: Seq[String] = WrappedArray(apple, coconut, banana)
now I want to create a new column in the dataframe with an array of 1's,0's, where 1 represent the item exist and 0 if the item doesn't present for that row.
Desired Output
+---+-----------+
| id| fruitlist|
+---+-----------+
| 0| [1,0,0] |
| 1| [0,1,0] |
| 2|[0,0,1] |
| 1| [0,1,0] |
| 2|[0,0,1] |
+---+-----------+
This is something I tried,
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val df = spark.createDataFrame(Seq(
(0, "apple"),
(1, "banana"),
(2, "coconut"),
(1, "banana"),
(2, "coconut")
)).toDF("id", "fruit")
df.show
import org.apache.spark.sql.functions._
val fruitList = df.select(collect_set("fruit")).first().getAs[Seq[String]](0)
print(fruitList)
I tried to solve this with OneHotEncoder but the result was something like this after converting to dense vector, which is not what I needed.
+---+-------+----------+-------------+---------+
| id| fruit|fruitIndex| fruitVec| vd|
+---+-------+----------+-------------+---------+
| 0| apple| 2.0| (2,[],[])|[0.0,0.0]|
| 1| banana| 1.0|(2,[1],[1.0])|[0.0,1.0]|
| 2|coconut| 0.0|(2,[0],[1.0])|[1.0,0.0]|
| 1| banana| 1.0|(2,[1],[1.0])|[0.0,1.0]|
| 2|coconut| 0.0|(2,[0],[1.0])|[1.0,0.0]|
+---+-------+----------+-------------+---------+
If you have a collection as
val fruitList: Seq[String] = Array("apple", "coconut", "banana")
Then you can either do it using inbuilt functions or udf function
inbuilt functions (array, when and lit)
import org.apache.spark.sql.functions._
df.withColumn("fruitList", array(fruitList.map(x => when(lit(x) === col("fruit"),1).otherwise(0)): _*)).show(false)
udf function
import org.apache.spark.sql.functions._
def containedUdf = udf((fruit: String) => fruitList.map(x => if(x == fruit) 1 else 0))
df.withColumn("fruitList", containedUdf(col("fruit"))).show(false)
which should give you
+---+-------+---------+
|id |fruit |fruitList|
+---+-------+---------+
|0 |apple |[1, 0, 0]|
|1 |banana |[0, 0, 1]|
|2 |coconut|[0, 1, 0]|
|1 |banana |[0, 0, 1]|
|2 |coconut|[0, 1, 0]|
+---+-------+---------+
udf functions are easy to understand and straight forward, dealing with primitive datatypes but should be avoided if optimized and fast inbuilt functions are available to do the same task
I hope the answer is helpful

structured streaming - explode json fields into dynamic columns?

I got this dataframe from a Kafka source.
+-----------------------+
| data |
+-----------------------+
| '{ "a": 1, "b": 2 }' |
+-----------------------+
| '{ "b": 3, "d": 4 }' |
+-----------------------+
| '{ "a": 2, "c": 4 }' |
+-----------------------+
I want to transform this into the following data frame:
+---------------------------+
| a | b | c | d |
+---------------------------+
| 1 | 2 | null | null |
+---------------------------+
| null | 3 | null | 4 |
+---------------------------+
| 2 | null | 4 | null |
+---------------------------+
Number of JSON fields may change, so I couldn’t specify a schema for it.
I pretty much got the idea how to do the transformation in spark batch, by using some map and reduce to get a set of JSON keys, then construct new dataframe by using withColumns.
However as far as I've been exploring, there is no map reduce function in structured streaming. How do I achieve this?
UPDATE
I figured out UDF can be utilized to parse string to JSON fields
import simplejson as json
from pyspark.sql.functions import udf
def convert_json(s):
return json.loads(s)
udf_convert_json = udf(convert_json, StructType(<..some schema here..>))
df = df.withColumn('parsed_data', udf_convert_json(df.data))
However since the schema is dynamic I need to get all JSON keys and values existed in df.data for a certain window period to construct a StructType used in udf return type.
In the end, I guess I need to know how to perform a reduce in dataset for a certain window period then use it as a lookup schema in stream transformation.
If you already know all unique keys in your json data, then we can use json_tuple function,
>>> df.show()
+------------------+
| data|
+------------------+
|{ "a": 1, "b": 2 }|
|{ "b": 3, "d": 4 }|
|{ "a": 2, "c": 4 }|
+------------------+
>>> from pyspark.sql import functions as F
>>> df.select(F.json_tuple(df.data,'a','b','c','d')).show()
+----+----+----+----+
| c0| c1| c2| c3|
+----+----+----+----+
| 1| 2|null|null|
|null| 3|null| 4|
| 2|null| 4|null|
+----+----+----+----+
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField("a", StringType()),StructField("b", StringType()),StructField("c",StringType()),StructField("d", StringType())])
>>> df.select(F.from_json(df.data,schema).alias('data')).select(F.col('data.*')).show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
| 1| 2|null|null|
|null| 3|null| 4|
| 2|null| 4|null|
+----+----+----+----+
When you have a dynamic JSON column inside your Pyspark Dataframe, you can use below code to explode it's fields to columns
df2 = df.withColumn('columnx', udf_transform_tojsonstring(df.columnx))
columnx_jsonDF = spark.read.json(df2.rdd.map(lambda row: row.columnx)).drop('_corrupt_record')
df3 = df2.withColumn('columnx', from_json(col('columnx'),columnx_jsonDF.schema))
for c in set(columnx_jsonDF.columns):
df3 = df3.withColumn(f'columnx_{c}',df2[f'columnx.`{c}`'])
explanation:
First we use a UDF to transform our column into a valid JSON string (if it's not already done)
In line 3 we read our column as JSON Dataframe (with inferred schema)
then we read columnx again with from_json() function, passing columnx_jsonDF schema to it
finally we add a column in the main Dataframe for each key inside our JSON Column
This works if we don't know the JSON fields in advance and yet, need to explode it's columns
I guess, you don't need to do much. For e.g.
Bala:~:$ cat myjson.json
{ "a": 1, "b": 2 }
{ "b": 3, "d": 4 }
{ "a": 2, "c": 4 }
>>> df = sqlContext.sql("select * from json.`/Users/Bala/myjson.json`")
>>> df.show()
+----+----+----+----+
| a| b| c| d|
+----+----+----+----+
| 1| 2|null|null|
|null| 3|null| 4|
| 2|null| 4|null|
+----+----+----+----+

Spark : How to group by distinct values in DataFrame

I have a data in a file in the following format:
1,32
1,33
1,44
2,21
2,56
1,23
The code I am executing is following:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
import sqlContext.implicits._
case class Person(a: Int, b: Int)
val ppl = sc.textFile("newfile.txt").map(_.split(","))
.map(p=> Person(p(0).trim.toInt, p(1).trim.toInt))
.toDF()
ppl.registerTempTable("people")
val result = ppl.select("a","b").groupBy('a).agg()
result.show
Expected Output is:
a 32, 33, 44, 23
b 21, 56
Instead of aggregation by sum, count, mean etc. I want every element in the row.
Try collect_set function inside agg()
val df = sc.parallelize(Seq(
(1,3), (1,6), (1,5), (2,1),(2,4)
(2,1))).toDF("a","b")
+---+---+
| a| b|
+---+---+
| 1| 3|
| 1| 6|
| 1| 5|
| 2| 1|
| 2| 4|
| 2| 1|
+---+---+
val df2 = df.groupBy("a").agg(collect_set("b")).show()
+---+--------------+
| a|collect_set(b)|
+---+--------------+
| 1| [3, 6, 5]|
| 2| [1, 4]|
+---+--------------+
And if you want duplicate entries , can use collect_list
val df3 = df.groupBy("a").agg(collect_list("b")).show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 1| [3, 6, 5]|
| 2| [1, 4, 1]|
+---+---------------+

Resources