Pyspark: How to remove an item from a collect_set? - apache-spark

In the following dataframe:
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
("a", "code1", "name"),
("a", "code1", "name2"),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
You can run this command to get a list of distinct values:
df.groupby("id").agg(F.collect_set("code")).show()
+---+-----------------+
| id|collect_set(code)|
+---+-----------------+
| a| [code2, code1]|
+---+-----------------+
How do you remove an item in the above collect_set? E.g. how to remove 'code2'

Update for Spark 2.4+: You can achieve this with array_remove:
df_grouped = df.groupby("id")\
.agg(F.array_remove(F.collect_set("code"), "code2").alias("codes"))
Original answer for Spark 2.3 and below
AFAIK there is no way to dynamically iterate over an ArrayType(), so if your data is already in an array you have two options:
Option 1: Explode, Filter, Collect
Use pyspark.sql.functions.explode() to turn the elements of the array into separate rows. Then use pyspark.sql.DataFrame.where() to filter out the desired values. Finally do a groupBy() and collect_set() to gather the data back into one row.
df_grouped = df.groupby("id").agg(F.collect_set("code").alias("codes"))
df_grouped.select("*", F.explode("codes").alias("exploded"))\
.where(~F.col("exploded").isin(["code2"]))\
.groupBy("id")\
.agg(F.collect_set("exploded").alias("codes"))\
.show()
#+---+-------+
#| id| codes|
#+---+-------+
#| a|[code1]|
#+---+-------+
Option 2: Use a UDF
def filter_code(array):
bad_values={"code2"}
return [x for x in array if x not in bad_values]
filter_code_udf = F.udf(lambda x: filter_code(x), ArrayType(StringType()))
df_grouped = df.groupby("id").agg(F.collect_set("code").alias("codes"))
df_grouped.withColumn("codes_filtered", filter_code_udf("codes")).show()
#+---+--------------+--------------+
#| id| codes|codes_filtered|
#+---+--------------+--------------+
#| a|[code2, code1]| [code1]|
#+---+--------------+--------------+
Of course, if you are starting from your original dataframe (before the groupBy() and collect_set()) you can just filter the desired values first:
df.where(~F.col("code").isin(["code2"])).groupby("id").agg(F.collect_set("code")).show()
#+---+-----------------+
#| id|collect_set(code)|
#+---+-----------------+
#| a| [code1]|
#+---+-----------------+

Related

Merging two or more dataframes/rdd efficiently in PySpark

I'm trying to merge three RDD's based on the same key. The following is the data.
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 15|
| 3| Candy| 15|
| 1| Bahroze| 15|
+------+---------+-----+
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 7342|
| 3| Candy| 5669|
| 1| Bahroze| 8361|
+------+---------+-----+
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 37|
| 3| Candy| 27|
| 1| Bahroze| 39|
+------+---------+-----+
I'm able to merge these three DF. I converted them to RDD dict with the following code for all three
new_rdd = userTotalVisits.rdd.map(lambda row: row.asDict(True))
After RDD conversion, I'm taking one RDD and the other two as lists. Mapping the first RDD and then adding other keys to it based on the same UserID. I was hoping there was a better way of doing this using pyspark. Here's the code I've written.
def transform(row):
# Add a new key to each row
for x in conversion_list: # first rdd in list of object as[{}] after using collect()
if( x['UserID'] == row['UserID'] ):
row["Total"] = { "Visitors": row["Total"], "Conversions": x["Total"] }
for y in Revenue_list: # second rdd in list of object as[{}] after using collect()
if( y['UserID'] == row['UserID'] ):
row["Total"]["Revenue"] = y["Total"]
return row
potato = new_rdd.map(lambda row: transform(row)) #first rdd
How should I efficiently merge these three RDDs/DFs? (because I had to perform three different task on a huge DF). Looking for a better efficient idea. PS I'm still spark newbie. The result of my code does is as follows which is what I need.
{'UserID': '2', 'UserLabel': 'Panda', 'Total': {'Visitors': 37, 'Conversions': 15, 'Revenue': 7342}}
{'UserID': '3', 'UserLabel': 'Candy', 'Total': {'Visitors': 27, 'Conversions': 15, 'Revenue': 5669}}
{'UserID': '1', 'UserLabel': 'Bahroze', 'Total': {'Visitors': 39, 'Conversions': 15, 'Revenue': 8361}}
Thank you.
You can join the 3 dataframes on columns ["UserID", "UserLabel"], create a new struct total from the 3 total columns:
from pyspark.sql import functions as F
result = df1.alias("conv") \
.join(df2.alias("rev"), ["UserID", "UserLabel"], "left") \
.join(df3.alias("visit"), ["UserID", "UserLabel"], "left") \
.select(
F.col("UserID"),
F.col("UserLabel"),
F.struct(
F.col("conv.Total").alias("Conversions"),
F.col("rev.Total").alias("Revenue"),
F.col("visit.Total").alias("Visitors")
).alias("Total")
)
# write into json file
result.write.json("output")
# print result:
for i in result.toJSON().collect():
print(i)
# {"UserID":3,"UserLabel":"Candy","Total":{"Conversions":15,"Revenue":5669,"Visitors":27}}
# {"UserID":1,"UserLabel":"Bahroze","Total":{"Conversions":15,"Revenue":8361,"Visitors":39}}
# {"UserID":2,"UserLabel":"Panda","Total":{"Conversions":15,"Revenue":7342,"Visitors":37}}
You can just do the left joins on all the three dataframes but make sure the first dataframe that you use has all the UserID and UserLabel Values. You can ignore the GroupBy operation as suggested by #blackbishop and still it would give you the required output
I am showing how it can be done in scala but you could do something similar in python.
//source data
val visitorDF = Seq((2,"Panda",15),(3,"Candy",15),(1,"Bahroze",15),(4,"Test",25)).toDF("UserID","UserLabel","Total")
val conversionsDF = Seq((2,"Panda",37),(3,"Candy",27),(1,"Bahroze",39)).toDF("UserID","UserLabel","Total")
val revenueDF = Seq((2,"Panda",7342),(3,"Candy",5669),(1,"Bahroze",8361)).toDF("UserID","UserLabel","Total")
import org.apache.spark.sql.functions._
val finalDF = visitorDF.as("v").join(conversionsDF.as("c"),Seq("UserID","UserLabel"),"left")
.join(revenueDF.as("r"),Seq("UserID","UserLabel"),"left")
.withColumn("TotalArray",struct($"v.Total".as("Visitor"),$"c.Total".as("Conversions"),$"r.Total".as("Revenue")))
.drop("Total")
display(finalDF)
You can see the output as below :

spark dynamically create struct/json per group

I have a spark dataframe like
+-----+---+---+---+------+
|group| a| b| c|config|
+-----+---+---+---+------+
| a| 1| 2| 3| [a]|
| b| 2| 3| 4|[a, b]|
+-----+---+---+---+------+
val df = Seq(("a", 1, 2, 3, Seq("a")),("b", 2, 3,4, Seq("a", "b"))).toDF("group", "a", "b","c", "config")
How can I add an additional column i.e.
df.withColumn("select_by_config", <<>>).show
as a struct or JSON which combines a number of columns (specified by config) in something similar to a hive named struct / spark struct / json column? Note, this struct is specific per group and not constant for the whole dataframe; it is specified in config column.
I can imagine that a df.map could do the trick, but the serialization overhead does not seem to be efficient. How can this be achieved via SQL only expressions? Maybe as a Map-type column?
edit
a possible but really clumsy solution for 2.2 is:
val df = Seq((1,"a", 1, 2, 3, Seq("a")),(2, "b", 2, 3,4, Seq("a", "b"))).toDF("id", "group", "a", "b","c", "config")
df.show
import spark.implicits._
final case class Foo(id:Int, c1:Int, specific:Map[String, Int])
df.map(r => {
val config = r.getAs[Seq[String]]("config")
print(config)
val others = config.map(elem => (elem, r.getAs[Int](elem))).toMap
Foo(r.getAs[Int]("id"), r.getAs[Int]("c"), others)
}).show
are there any better ways to solve the problem for 2.2?
If you use a recent build (Spark 2.4.0 RC 1 or later) a combination of higher order functions should do the trick. Create a map of columns:
import org.apache.spark.sql.functions.{
array, col, expr, lit, map_from_arrays, map_from_entries
}
val cols = Seq("a", "b", "c")
val dfm = df.withColumn(
"cmap",
map_from_arrays(array(cols map lit: _*), array(cols map col: _*))
)
and transform the config:
dfm.withColumn(
"config_mapped",
map_from_entries(expr("transform(config, k -> struct(k, cmap[k]))"))
).show
// +-----+---+---+---+------+--------------------+----------------+
// |group| a| b| c|config| cmap| config_mapped|
// +-----+---+---+---+------+--------------------+----------------+
// | a| 1| 2| 3| [a]|[a -> 1, b -> 2, ...| [a -> 1]|
// | b| 2| 3| 4|[a, b]|[a -> 2, b -> 3, ...|[a -> 2, b -> 3]|
// +-----+---+---+---+------+--------------------+----------------+

How to change case of whole pyspark dataframe to lower or upper

I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help
#Code for Dataframe column headers
self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df_db1.columns])
Assuming df is your dataframe, this should do the work:
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(col, F.lower(F.col(col)))
Both answers seems to be ok with one exception - if you have numeric column, it will be converted to string column. To avoid this, try:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val fields = df.schema.fields
val stringFields = df.schema.fields.filter(f => f.dataType == StringType)
val nonStringFields = df.schema.fields.filter(f => f.dataType != StringType).map(f => f.name).map(f => col(f))
val stringFieldsTransformed = stringFields .map (f => f.name).map(f => upper(col(f)).as(f))
val df = sourceDF.select(stringFieldsTransformed ++ nonStringFields: _*)
Now types are correct also when you have non-string fields, i.e. numeric fields).
If you know that each column is of String type, use one of the other answers - they are correct in that cases :)
Python code in PySpark:
from pyspark.sql.functions import *
from pyspark.sql.types import *
sourceDF = spark.createDataFrame([(1, "a")], ['n', 'n1'])
fields = sourceDF.schema.fields
stringFields = filter(lambda f: isinstance(f.dataType, StringType), fields)
nonStringFields = map(lambda f: col(f.name), filter(lambda f: not isinstance(f.dataType, StringType), fields))
stringFieldsTransformed = map(lambda f: upper(col(f.name)), stringFields)
allFields = [*stringFieldsTransformed, *nonStringFields]
df = sourceDF.select(allFields)
You can generate an expression using list comprehension:
from pyspark.sql import functions as psf
expression = [ psf.lower(psf.col(x)).alias(x) for x in df.columns ]
And then just call it over your existing dataframe
>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
>>> df.select(*select_expression).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+

PySpark: withColumn() with two conditions and three outcomes

I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode:
df = df.withColumn('new_column',
IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)
I am trying to do this in PySpark but I'm not sure about the syntax. Any pointers? I looked into expr() but couldn't get it to work.
Note that df is a pyspark.sql.dataframe.DataFrame.
There are a few efficient ways to implement this. Let's start with required imports:
from pyspark.sql.functions import col, expr, when
You can use Hive IF function inside expr:
new_column_1 = expr(
"""IF(fruit1 IS NULL OR fruit2 IS NULL, 3, IF(fruit1 = fruit2, 1, 0))"""
)
or when + otherwise:
new_column_2 = when(
col("fruit1").isNull() | col("fruit2").isNull(), 3
).when(col("fruit1") == col("fruit2"), 1).otherwise(0)
Finally you could use following trick:
from pyspark.sql.functions import coalesce, lit
new_column_3 = coalesce((col("fruit1") == col("fruit2")).cast("int"), lit(3))
With example data:
df = sc.parallelize([
("orange", "apple"), ("kiwi", None), (None, "banana"),
("mango", "mango"), (None, None)
]).toDF(["fruit1", "fruit2"])
you can use this as follows:
(df
.withColumn("new_column_1", new_column_1)
.withColumn("new_column_2", new_column_2)
.withColumn("new_column_3", new_column_3))
and the result is:
+------+------+------------+------------+------------+
|fruit1|fruit2|new_column_1|new_column_2|new_column_3|
+------+------+------------+------------+------------+
|orange| apple| 0| 0| 0|
| kiwi| null| 3| 3| 3|
| null|banana| 3| 3| 3|
| mango| mango| 1| 1| 1|
| null| null| 3| 3| 3|
+------+------+------------+------------+------------+
You'll want to use a udf as below
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
def func(fruit1, fruit2):
if fruit1 == None or fruit2 == None:
return 3
if fruit1 == fruit2:
return 1
return 0
func_udf = udf(func, IntegerType())
df = df.withColumn('new_column',func_udf(df['fruit1'], df['fruit2']))
The withColumn function in pyspark enables you to make a new variable with conditions, add in the when and otherwise functions and you have a properly working if then else structure.
For all of this you would need to import the sparksql functions, as you will see that the following bit of code will not work without the col() function.
In the first bit, we declare a new column -'new column', and then give the condition enclosed in when function (i.e. fruit1==fruit2) then give 1 if the condition is true, if untrue the control goes to the otherwise which then takes care of the second condition (fruit1 or fruit2 is Null) with the isNull() function and if true 3 is returned and if false, the otherwise is checked again giving 0 as the answer.
from pyspark.sql import functions as F
df=df.withColumn('new_column',
F.when(F.col('fruit1')==F.col('fruit2'), 1)
.otherwise(F.when((F.col('fruit1').isNull()) | (F.col('fruit2').isNull()), 3))
.otherwise(0))

Aggregate First Grouped Item from Subsequent Items

I have user game sessions containing: user id, game id, score and a timestamp when the game was played.
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("u1", "g1", 10, 0),
("u1", "g3", 2, 2),
("u1", "g3", 5, 3),
("u1", "g4", 5, 4),
("u2", "g2", 1, 1),
], ["UserID", "GameID", "Score", "Time"])
Desired Output
+------+-------------+-------------+
|UserID|MaxScoreGame1|MaxScoreGame2|
+------+-------------+-------------+
| u1| 10| 5|
| u2| 1| null|
+------+-------------+-------------+
I want to transform the data such that I get the max score of the first game the user played as well as the max score of the second game (bonus if I can also get the max score of all subsequent games). Unfortunately I'm not sure how that's possible to do with Spark SQL.
I know I can group by UserID, GameID and then agg to get the max score and min time. Not sure to how to proceed from there.
Clarification: note that MaxScoreGame1 and MaxScoreGame2 refer to the first and second game user player; not the GameID.
You could try using a combination of Window functions and Pivot.
Get the row number for every game partitioned by UserID ordered by Time.
Filter down to GameNumber being 1 or 2.
Pivot on that to get your desired output shape.
Unfortunately I am using scala not python, but the below should be fairly easily transferable to python library.
import org.apache.spark.sql.expressions.Window
// Use a window function to get row number
val rowNumberWindow = Window.partitionBy(col("UserId")).orderBy(col("Time"))
val output = {
df
.select(
col("*"),
row_number().over(rowNumberWindow).alias("GameNumber")
)
.filter(col("GameNumber") <= lit(2))
.groupBy(col("UserId"))
.pivot("GameNumber")
.agg(
sum(col("Score"))
)
}
output.show()
+------+---+----+
|UserId| 1| 2|
+------+---+----+
| u1| 10| 2|
| u2| 1|null|
+------+---+----+
Solution with PySpark:
from pyspark.sql import Window
rowNumberWindow = Window.partitionBy("UserID").orderBy(F.col("Time"))
(df
.groupBy("UserID", "GameID")
.agg(F.max("Score").alias("Score"),
F.min("Time").alias("Time"))
.select(F.col("*"),
F.row_number().over(rowNumberWindow).alias("GameNumber"))
.filter(F.col("GameNumber") <= F.lit(2))
.withColumn("GameMaxScoreCol", F.concat(F.lit("MaxScoreGame"), F.col("GameNumber")))
.groupBy("UserID")
.pivot("GameMaxScoreCol")
.agg(F.max("Score"))
).show()
+------+-------------+-------------+
|UserID|MaxScoreGame1|MaxScoreGame2|
+------+-------------+-------------+
| u1| 10| 5|
| u2| 1| null|
+------+-------------+-------------+

Resources