Pandas UDF with dictionary lookup and conditionals - apache-spark

I want to use pandas_udf in Pyspark for certain transformations and calculations of column. And it seems that pandas udf can't be written exactly as normal UDFs.
An example function looks something like below:
def modify_some_column(example_column_1, example_column_2):
lookup_dict = {'a' : 1, 'b' : 2, 'c' : 3,'d': 4, 'e' : 5} #can be anything
if example_column_1 in lookup_dict:
if(example_column_1 == 'a' and example_column_2 == "something"):
return lookup_dict[example_column_1]
elif(example_column_1 == 'a' and example_column_2 == "something else"):
return "something else"
else:
return lookup_dict[example_column_1]
else:
return ""
Basically, takes in two column values from a spark dataframe and returns a value which I intend to use with withColumn:
modify_some_column_udf = pandas_udf(modify_some_column, returnType= StringType())
df = df.withColumn('new_col',modify_property_type_udf(df.col_1,df.col_2))
But this does not work. How should I modify the above to be able to use it in pandas udf?
Edit: It is clear to me that the above conditions can be easily and efficiently be implemented using native PySpark functions. But I am looking to write the above logic using Pandas UDF.

With this simple if/else logic, you don't have to use UDF. In fact you should avoid to use UDFs as much as possible.
Assuming you have the dataframe as follow
df = spark.createDataFrame([
('a', 'something'),
('a', 'something else'),
('c', None),
('c', ''),
('c', 'something'),
('c', 'something else'),
('c', 'blah'),
('f', 'blah'),
], ['c1', 'c2'])
df.show()
+---+--------------+
| c1| c2|
+---+--------------+
| a| something|
| a|something else|
| c| null|
| c| |
| c| something|
| c|something else|
| c| blah|
| f| blah|
+---+--------------+
You can create a temporary lookup column and use it to check against other columns
import json
your_lookup_dict = {'a' : 1, 'b' : 2, 'c' : 3,'d': 4, 'e' : 5}
import pyspark.sql.functions as F
(df
.withColumn('lookup', F.from_json(F.lit(json.dumps(your_lookup_dict)), 'map<string, string>'))
.withColumn('mod', F
.when((F.col('c1') == 'a') & (F.col('c2') == 'something'), F.col('lookup')[F.col('c1')])
.when((F.col('c1') == 'a') & (F.col('c2') == 'something else'), F.lit('something else'))
.otherwise(F.col('lookup')[F.col('c1')])
)
.show(10, False)
)
+---+--------------+----------------------------------------+--------------+
|c1 |c2 |lookup |mod |
+---+--------------+----------------------------------------+--------------+
|a |something |{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|1 |
|a |something else|{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|something else|
|c |null |{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|3 |
|c | |{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|3 |
|c |something |{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|3 |
|c |something else|{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|3 |
|c |blah |{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|3 |
|f |blah |{a -> 1, b -> 2, c -> 3, d -> 4, e -> 5}|null |
+---+--------------+----------------------------------------+--------------+
EDIT
Since you insisted to use Pandas UDF, you'd have to understand that Pandas execute your dataframe by batches, so you'll have to wrap your functions to something like this
def wrapper(iterator):
def modify_some_column(example_column_1, example_column_2):
lookup_dict = {'a' : 1, 'b' : 2, 'c' : 3,'d': 4, 'e' : 5} #can be anything
if example_column_1 in lookup_dict:
if(example_column_1 == 'a' and example_column_2 == "something"):
return str(lookup_dict[example_column_1])
elif(example_column_1 == 'a' and example_column_2 == "something else"):
return "something else"
else:
return str(lookup_dict[example_column_1])
else:
return ""
for pdf in iterator:
pdf['mod'] = pdf.apply(lambda r: modify_some_column(r['c1'], r['c2']), axis=1)
yield pdf
df = df.withColumn('mod', F.lit('temp'))
df.mapInPandas(wrapper, df.schema).show()
+---+--------------+--------------+
| c1| c2| mod|
+---+--------------+--------------+
| a| something| 1|
| a|something else|something else|
| c| null| 3|
| c| | 3|
| c| something| 3|
| c|something else| 3|
| c| blah| 3|
| f| blah| |
+---+--------------+--------------+

Related

Collect Column names in Spark without using case when

I have a dataframe with follwing columns(A, A_1, B, B_1, C, C_1, D, D_1, E, E_1).
For example if A = A_1 and B <> B_1 and C <> C_1 and D <> D_1 and E <> E_1 then I want to create a column and mark it as 'B,C,D,E' for this records. Similarly all other combinations are possible
I tried writing case when but there are a lot of combinations
i'm unsure if it'd be possible without looping over columns. so, here's an easy approach that loops over your columns and checks if the column pair is not equal - if not equal, then output the column name. we can then concatenate the column names that we get from the loop.
example below
data_ls = [
(1, 1, 1, 3, 4, 4, 5, 6),
(1, 0, 1, 3, 4, 4, 5, 6)
]
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['a', 'a_1', 'b', 'b_1', 'c', 'c_1', 'd', 'd_1'])
# +---+---+---+---+---+---+---+---+
# | a|a_1| b|b_1| c|c_1| d|d_1|
# +---+---+---+---+---+---+---+---+
# | 1| 1| 1| 3| 4| 4| 5| 6|
# | 1| 0| 1| 3| 4| 4| 5| 6|
# +---+---+---+---+---+---+---+---+
# create condition statements for each column pair in a list
col_conditions = [func.when(func.col(k) != func.col(k+'_1'), func.lit(k)) for k in data_sdf.columns if not k.endswith('_1')]
# [Column<'CASE WHEN (NOT (a = a_1)) THEN a END'>,
# Column<'CASE WHEN (NOT (b = b_1)) THEN b END'>,
# Column<'CASE WHEN (NOT (c = c_1)) THEN c END'>,
# Column<'CASE WHEN (NOT (d = d_1)) THEN d END'>]
# concatenate the case when statements with concat_ws
data_sdf. \
withColumn('ineq_cols', func.concat_ws(',', *col_conditions)). \
show(truncate=False)
# +---+---+---+---+---+---+---+---+---------+
# |a |a_1|b |b_1|c |c_1|d |d_1|ineq_cols|
# +---+---+---+---+---+---+---+---+---------+
# |1 |1 |1 |3 |4 |4 |5 |6 |b,d |
# |1 |0 |1 |3 |4 |4 |5 |6 |a,b,d |
# +---+---+---+---+---+---+---+---+---------+

How do I use lag function to get my desired df from my source df?

I have a source dataframe that looks like this -
Id
Offset
a
b
c
d
e
f
p
1
1
2
null
null
null
null
p
2
null
null
3
4
null
null
q
1
1
2
null
null
null
null
q
2
null
null
3
4
null
null
q
3
null
null
null
null
5
6
You can think of the columns (a-f) to be some features that describe some object (named Id), and these features get updated over time (the offsets). Not all of these features will be updated at the same time. This data is essentially my first df. From this df though, I need to get something like the second df, that essentially describes my objects with all the feature data available at that point of time.
I need my output df to be like this -
Id
Offset
a
b
c
d
e
f
p
1
1
2
n
n
n
n
p
2
1
2
3
4
n
n
q
1
1
2
n
n
n
n
q
2
1
2
3
4
n
n
q
3
1
2
3
4
5
6
how can I achieve this with lag function (or something else?) in pyspark?
Based on the input and the expected output and your explanation, I assume you want to fill for all rows following a row with a non-null value, the value contained in the non-null row.
To do this, you can apply last aggregate over a window and for rows between current and all previous rows while ignoring nulls.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [("p", 1, 1, 2, None, None, None, None,),
("p", 2, None, None, 3, 4, None, None,),
("q", 1, 1, 2, None, None, None, None,),
("q", 2, None, None, 3, 4, None, None,),
("q", 3, None, None, None, None, 5, 6,), ]
df = spark.createDataFrame(data, ("Id", "Offset", "a", "b", "c", "d", "e", "f",))
window_spec = W.partitionBy("Id").orderBy(F.asc("Offset")).rowsBetween(W.unboundedPreceding, W.currentRow)
features_to_transform = ["a", "b", "c", "d", "e", "f"]
transformations = [(F.last(feature, ignorenulls=True).over(window_spec)).alias(feature)
for feature in features_to_transform]
df.select("Id", "Offset", *transformations).show()
Output
+---+------+---+---+----+----+----+----+
| Id|Offset| a| b| c| d| e| f|
+---+------+---+---+----+----+----+----+
| p| 1| 1| 2|null|null|null|null|
| p| 2| 1| 2| 3| 4|null|null|
| q| 1| 1| 2|null|null|null|null|
| q| 2| 1| 2| 3| 4|null|null|
| q| 3| 1| 2| 3| 4| 5| 6|
+---+------+---+---+----+----+----+----+

pyspark counting number of nulls per group

I have a dataframe that has time series data in it and some categorical data
| cat | TS1 | TS2 | ... |
| A | 1 | null | ... |
| A | 1 | 20 | ... |
| B | null | null | ... |
| A | null | null | ... |
| B | 1 | 100 | ... |
I would like to find out how many null values there are per column per group, so an expected output would look something like:
| cat | TS1 | TS2 |
| A | 1 | 2 |
| B | 1 | 1 |
Currently I can this for one of the groups with something like this
df_null_cats = df.where(df.cat == "A").where(reduce(lambda x, y: x | y, (col(c).isNull() for c in df.columns))).select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_nulls.columns])
but I am struggling to get one that would work for the whole dataframe.
You can use groupBy and aggregation function to get required output.
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master("local").getOrCreate()
# Sample dataframe
in_values = [("A", 1, None),
("A", 1, 20),
("B", None, None),
("A", None, None),
("B", 1, 100)]
in_df = spark.createDataFrame(in_values, "cat string, TS1 int, TS2 int")
columns = in_df.columns
# Ignoring groupBy column and considering cols which are required in aggregation
columns.remove("cat")
agg_expression = [sum(when(in_df[x].isNull(), 1).otherwise(0)).alias(x) for x in columns]
in_df.groupby("cat").agg(*agg_expression).show()
+---+---+---+
|cat|TS1|TS2|
+---+---+---+
| B| 1| 1|
| A| 1| 2|
+---+---+---+
"Sum" function can be used with condition for null value. On Scala:
val df = Seq(
(Some("A"), Some(1), None),
(Some("A"), Some(1), Some(20)),
(Some("B"), None, None),
(Some("A"), None, None),
(Some("B"), Some(1), Some(100)),
).toDF("cat", "TS1", "TS2")
val aggregatorColumns = df
.columns
.tail
.map(columnName => sum(when(col(columnName).isNull, 1).otherwise(0)).alias(columnName))
df
.groupBy("cat")
.agg(
aggregatorColumns.head, aggregatorColumns.tail: _*
)
#Mohana's answer is good but it's still not dynamic: you need to code the operation for every single column.
In my answer below, we can use Pandas UDFs and applyInPandas to write a simple function in Pandas which will then be applied to our PySpark dataframe.
import pandas as pd
from pyspark.sql.types import *
in_values = [("A", 1, None),
("A", 1, 20),
("B", None, None),
("A", None, None),
("B", 1, 100)]
df = spark.createDataFrame(in_values, "cat string, TS1 int, TS2 int")
# define output schema: same column names, but we must ensure that the output type is integer
output_schema = StructType(
[StructField('cat', StringType())] + \
[StructField(col, IntegerType(), True) for col in [c for c in df.columns if c.startswith('TS')]]
)
# custom Python function to define aggregations in Pandas
def null_count(pdf):
columns = [c for c in pdf.columns if c.startswith('TS')]
result = pdf\
.groupby('cat')[columns]\
.agg(lambda x: x.isnull().sum())\
.reset_index()
return result
# use applyInPandas
df\
.groupby('cat')\
.applyInPandas(null_count, output_schema)\
.show()
+---+---+---+
|cat|TS1|TS2|
+---+---+---+
| A| 1| 2|
| B| 1| 1|
+---+---+---+

Pyspark create dictionary within groupby

Is it possible in pyspark to create dictionary within groupBy.agg()? Here is a toy example:
import pyspark
from pyspark.sql import Row
import pyspark.sql.functions as F
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
toy_data = spark.createDataFrame([
Row(id=1, key='a', value="123"),
Row(id=1, key='b', value="234"),
Row(id=1, key='c', value="345"),
Row(id=2, key='a', value="12"),
Row(id=2, key='x', value="23"),
Row(id=2, key='y', value="123")])
toy_data.show()
+---+---+-----+
| id|key|value|
+---+---+-----+
| 1| a| 123|
| 1| b| 234|
| 1| c| 345|
| 2| a| 12|
| 2| x| 23|
| 2| y| 123|
+---+---+-----+
and this is the expected output:
---+------------------------------------
id | key_value
---+------------------------------------
1 | {"a": "123", "b": "234", "c": "345"}
2 | {"a": "12", "x": "23", "y": "123"}
---+------------------------------------
======================================
I tried this but doesn't work.
toy_data.groupBy("id").agg(
F.create_map(col("key"),col("value")).alias("key_value")
)
This yields the following error:
AnalysisException: u"expression '`key`' is neither present in the group by, nor is it an aggregate function....
The agg component has to contain actual aggregation function. One way to approach this is to combine collect_list
Aggregate function: returns a list of objects with duplicates.
struct:
Creates a new struct column.
and map_from_entries
Collection function: Returns a map created from the given array of entries.
This is how you'd do that:
toy_data.groupBy("id").agg(
F.map_from_entries(
F.collect_list(
F.struct("key", "value"))).alias("key_value")
).show(truncate=False)
+---+------------------------------+
|id |key_value |
+---+------------------------------+
|1 |[a -> 123, b -> 234, c -> 345]|
|2 |[a -> 12, x -> 23, y -> 123] |
+---+------------------------------+
For pyspark < 2.4.0 where pyspark.sql.functions.map_from_entries is not available you can use own created udf function
import pyspark.sql.functions as F
from pyspark.sql.types import MapType, StringType
#F.udf(returnType=MapType(StringType(), StringType()))
def map_array(column):
return dict(column)
(toy_data.groupBy("id")
.agg(F.collect_list(F.struct("key", "value")).alias("key_value"))
.withColumn('key_value', map_array('key_value'))
.show(truncate=False))
+---+------------------------------+
|id |key_value |
+---+------------------------------+
|1 |[a -> 123, b -> 234, c -> 345]|
|2 |[x -> 23, a -> 12, y -> 123] |
+---+------------------------------+

How to use groupBy to collect rows into a map?

Context
sqlContext.sql(s"""
SELECT
school_name,
name,
age
FROM my_table
""")
Ask
Given the above table, I would like to group by school name and collect name, age into a Map[String, Int]
For example - Pseudo-code
val df = sqlContext.sql(s"""
SELECT
school_name,
age
FROM my_table
GROUP BY school_name
""")
------------------------
school_name | name | age
------------------------
school A | "michael"| 7
school A | "emily" | 5
school B | "cathy" | 10
school B | "shaun" | 5
df.groupBy("school_name").agg(make_map)
------------------------------------
school_name | map
------------------------------------
school A | {"michael": 7, "emily": 5}
school B | {"cathy": 10, "shaun": 5}
Following will work with Spark 2.0. You can use map function available since 2.0 release to get columns as Map.
val df1 = df.groupBy(col("school_name")).agg(collect_list(map($"name",$"age")) as "map")
df1.show(false)
This will give you below output.
+-----------+------------------------------------+
|school_name|map |
+-----------+------------------------------------+
|school B |[Map(cathy -> 10), Map(shaun -> 5)] |
|school A |[Map(michael -> 7), Map(emily -> 5)]|
+-----------+------------------------------------+
Now you can use UDF to join individual Maps into single Map like below.
import org.apache.spark.sql.functions.udf
val joinMap = udf { values: Seq[Map[String,Int]] => values.flatten.toMap }
val df2 = df1.withColumn("map", joinMap(col("map")))
df2.show(false)
This will give required output with Map[String,Int].
+-----------+-----------------------------+
|school_name|map |
+-----------+-----------------------------+
|school B |Map(cathy -> 10, shaun -> 5) |
|school A |Map(michael -> 7, emily -> 5)|
+-----------+-----------------------------+
If you want to convert a column value into JSON String then Spark 2.1.0 has introduced to_json function.
val df3 = df2.withColumn("map",to_json(struct($"map")))
df3.show(false)
The to_json function will return following output.
+-----------+-------------------------------+
|school_name|map |
+-----------+-------------------------------+
|school B |{"map":{"cathy":10,"shaun":5}} |
|school A |{"map":{"michael":7,"emily":5}}|
+-----------+-------------------------------+
As of spark 2.4 you can use map_from_arrays function to achieve this.
val df = spark.sql(s"""
SELECT *
FROM VALUES ('s1','a',1),('s1','b',2),('s2','a',1)
AS (school, name, age)
""")
val df2 = df.groupBy("school").agg(map_from_arrays(collect_list($"name"), collect_list($"age")).as("map"))
+------+----+---+
|school|name|age|
+------+----+---+
| s1| a| 1|
| s1| b| 2|
| s2| a| 1|
+------+----+---+
+------+----------------+
|school| map|
+------+----------------+
| s2| [a -> 1]|
| s1|[a -> 1, b -> 2]|
+------+----------------+
df.select($"school_name",concat_ws(":",$"age",$"name").as("new_col")).groupBy($"school_name").agg(collect_set($"new_col")).show
+-----------+--------------------+
|school_name|collect_set(new_col)|
+-----------+--------------------+
| school B| [5:shaun, 10:cathy]|
| school A|[7:michael, 5:emily]|
+-----------+--------------------+

Resources