Aggregate First Grouped Item from Subsequent Items - apache-spark

I have user game sessions containing: user id, game id, score and a timestamp when the game was played.
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("u1", "g1", 10, 0),
("u1", "g3", 2, 2),
("u1", "g3", 5, 3),
("u1", "g4", 5, 4),
("u2", "g2", 1, 1),
], ["UserID", "GameID", "Score", "Time"])
Desired Output
+------+-------------+-------------+
|UserID|MaxScoreGame1|MaxScoreGame2|
+------+-------------+-------------+
| u1| 10| 5|
| u2| 1| null|
+------+-------------+-------------+
I want to transform the data such that I get the max score of the first game the user played as well as the max score of the second game (bonus if I can also get the max score of all subsequent games). Unfortunately I'm not sure how that's possible to do with Spark SQL.
I know I can group by UserID, GameID and then agg to get the max score and min time. Not sure to how to proceed from there.
Clarification: note that MaxScoreGame1 and MaxScoreGame2 refer to the first and second game user player; not the GameID.

You could try using a combination of Window functions and Pivot.
Get the row number for every game partitioned by UserID ordered by Time.
Filter down to GameNumber being 1 or 2.
Pivot on that to get your desired output shape.
Unfortunately I am using scala not python, but the below should be fairly easily transferable to python library.
import org.apache.spark.sql.expressions.Window
// Use a window function to get row number
val rowNumberWindow = Window.partitionBy(col("UserId")).orderBy(col("Time"))
val output = {
df
.select(
col("*"),
row_number().over(rowNumberWindow).alias("GameNumber")
)
.filter(col("GameNumber") <= lit(2))
.groupBy(col("UserId"))
.pivot("GameNumber")
.agg(
sum(col("Score"))
)
}
output.show()
+------+---+----+
|UserId| 1| 2|
+------+---+----+
| u1| 10| 2|
| u2| 1|null|
+------+---+----+

Solution with PySpark:
from pyspark.sql import Window
rowNumberWindow = Window.partitionBy("UserID").orderBy(F.col("Time"))
(df
.groupBy("UserID", "GameID")
.agg(F.max("Score").alias("Score"),
F.min("Time").alias("Time"))
.select(F.col("*"),
F.row_number().over(rowNumberWindow).alias("GameNumber"))
.filter(F.col("GameNumber") <= F.lit(2))
.withColumn("GameMaxScoreCol", F.concat(F.lit("MaxScoreGame"), F.col("GameNumber")))
.groupBy("UserID")
.pivot("GameMaxScoreCol")
.agg(F.max("Score"))
).show()
+------+-------------+-------------+
|UserID|MaxScoreGame1|MaxScoreGame2|
+------+-------------+-------------+
| u1| 10| 5|
| u2| 1| null|
+------+-------------+-------------+

Related

Pyspark Window Function to count Txns in Last Hour by Row

I have a df like the one in the example below - it is sorted in descending timeframe. I want to calculate by row the numbers of txns in the last hour - it would be at least 1 (the current), but could be of course more
I am using an approach I modified from How to aggregate over rolling time window with groups in Spark
But it is not working and I am at a loss
I expect the first row will yield 1, the second 4, the third 3, the fourth 2 and the fifth 1
I would also like to get the maximum index of those txns that are counted ( ihave not worked on that if I get the first part I think I can modify it to do that too)
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as F
from pyspark.sql.window import Window
schema = StructType([
StructField("datetime_string", StringType(), True),
StructField("group_by", StringType(), True),
StructField("index", IntegerType(), True)
])
df = spark.createDataFrame([Row(datetime_string='2022-01-05 04:12:18', group_by='group1', index=1),
Row(datetime_string='2022-01-05 02:43:19', group_by='group1', index=2),
Row(datetime_string='2022-01-05 02:23:30', group_by='group1', index=3),
Row(datetime_string='2022-01-05 02:20:38', group_by='group1', index=4),
Row(datetime_string='2022-01-05 02:18:39', group_by='group1', index=5)
],schema)
#create the timestamp and a version in seconds for math later on
df = (df
.withColumn('timestamp',F.to_timestamp(F.col('datetime_string')))
.withColumn('ts_seconds',F.col('timestamp').cast('long'))
)
w = (Window()
.partitionBy(F.col("group_by"))
.orderBy(F.desc('ts_seconds'))
.rowsBetween(Window.currentRow, Window.unboundedFollowing)
)
diff = F.max('ts_seconds').over(w) - F.col('ts_seconds')
indicator = (diff < 60*60 ).cast("Integer")
subgroup = F.sum(indicator).over(w)
df1 = (df
.withColumn('txns_last_hour',subgroup)
)
df1.show()
what I get is this
| datetime_string|group_by|index| timestamp|ts_seconds|txns_last_hour|
+-------------------+--------+-----+-------------------+----------+--------------+
|2022-01-05 04:12:18| group1| 1|2022-01-05 04:12:18|1641377538| 5|
|2022-01-05 02:43:19| group1| 2|2022-01-05 02:43:19|1641372199| 4|
|2022-01-05 02:23:30| group1| 3|2022-01-05 02:23:30|1641371010| 3|
|2022-01-05 02:20:38| group1| 4|2022-01-05 02:20:38|1641370838| 2|
|2022-01-05 02:18:39| group1| 5|2022-01-05 02:18:39|1641370719| 1|
If someone can be graceful enough to explain what am I doing wrong I would appreciate it
Use rangeBetween instead of rowsBetween for the Window frame boundaries. Also, your Window ordering isn't correct. From the result you expect it seems you need to order on ascending not descending.
Try with following modifications:
w = (Window()
.partitionBy(F.col("group_by"))
.orderBy('ts_seconds')
.rangeBetween(-60*60, Window.currentRow)
)
df1 = (df
.withColumn('txns_last_hour', F.count("*").over(w))
.orderBy(F.desc("ts_seconds"))
)
df1.show()
#+-------------------+--------+-----+-------------------+----------+--------------+
#| datetime_string|group_by|index| timestamp|ts_seconds|txns_last_hour|
#+-------------------+--------+-----+-------------------+----------+--------------+
#|2022-01-05 04:12:18| group1| 1|2022-01-05 04:12:18|1641352338| 1|
#|2022-01-05 02:43:19| group1| 2|2022-01-05 02:43:19|1641346999| 4|
#|2022-01-05 02:23:30| group1| 3|2022-01-05 02:23:30|1641345810| 3|
#|2022-01-05 02:20:38| group1| 4|2022-01-05 02:20:38|1641345638| 2|
#|2022-01-05 02:18:39| group1| 5|2022-01-05 02:18:39|1641345519| 1|
#+-------------------+--------+-----+-------------------+----------+--------------+

Merging two or more dataframes/rdd efficiently in PySpark

I'm trying to merge three RDD's based on the same key. The following is the data.
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 15|
| 3| Candy| 15|
| 1| Bahroze| 15|
+------+---------+-----+
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 7342|
| 3| Candy| 5669|
| 1| Bahroze| 8361|
+------+---------+-----+
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 37|
| 3| Candy| 27|
| 1| Bahroze| 39|
+------+---------+-----+
I'm able to merge these three DF. I converted them to RDD dict with the following code for all three
new_rdd = userTotalVisits.rdd.map(lambda row: row.asDict(True))
After RDD conversion, I'm taking one RDD and the other two as lists. Mapping the first RDD and then adding other keys to it based on the same UserID. I was hoping there was a better way of doing this using pyspark. Here's the code I've written.
def transform(row):
# Add a new key to each row
for x in conversion_list: # first rdd in list of object as[{}] after using collect()
if( x['UserID'] == row['UserID'] ):
row["Total"] = { "Visitors": row["Total"], "Conversions": x["Total"] }
for y in Revenue_list: # second rdd in list of object as[{}] after using collect()
if( y['UserID'] == row['UserID'] ):
row["Total"]["Revenue"] = y["Total"]
return row
potato = new_rdd.map(lambda row: transform(row)) #first rdd
How should I efficiently merge these three RDDs/DFs? (because I had to perform three different task on a huge DF). Looking for a better efficient idea. PS I'm still spark newbie. The result of my code does is as follows which is what I need.
{'UserID': '2', 'UserLabel': 'Panda', 'Total': {'Visitors': 37, 'Conversions': 15, 'Revenue': 7342}}
{'UserID': '3', 'UserLabel': 'Candy', 'Total': {'Visitors': 27, 'Conversions': 15, 'Revenue': 5669}}
{'UserID': '1', 'UserLabel': 'Bahroze', 'Total': {'Visitors': 39, 'Conversions': 15, 'Revenue': 8361}}
Thank you.
You can join the 3 dataframes on columns ["UserID", "UserLabel"], create a new struct total from the 3 total columns:
from pyspark.sql import functions as F
result = df1.alias("conv") \
.join(df2.alias("rev"), ["UserID", "UserLabel"], "left") \
.join(df3.alias("visit"), ["UserID", "UserLabel"], "left") \
.select(
F.col("UserID"),
F.col("UserLabel"),
F.struct(
F.col("conv.Total").alias("Conversions"),
F.col("rev.Total").alias("Revenue"),
F.col("visit.Total").alias("Visitors")
).alias("Total")
)
# write into json file
result.write.json("output")
# print result:
for i in result.toJSON().collect():
print(i)
# {"UserID":3,"UserLabel":"Candy","Total":{"Conversions":15,"Revenue":5669,"Visitors":27}}
# {"UserID":1,"UserLabel":"Bahroze","Total":{"Conversions":15,"Revenue":8361,"Visitors":39}}
# {"UserID":2,"UserLabel":"Panda","Total":{"Conversions":15,"Revenue":7342,"Visitors":37}}
You can just do the left joins on all the three dataframes but make sure the first dataframe that you use has all the UserID and UserLabel Values. You can ignore the GroupBy operation as suggested by #blackbishop and still it would give you the required output
I am showing how it can be done in scala but you could do something similar in python.
//source data
val visitorDF = Seq((2,"Panda",15),(3,"Candy",15),(1,"Bahroze",15),(4,"Test",25)).toDF("UserID","UserLabel","Total")
val conversionsDF = Seq((2,"Panda",37),(3,"Candy",27),(1,"Bahroze",39)).toDF("UserID","UserLabel","Total")
val revenueDF = Seq((2,"Panda",7342),(3,"Candy",5669),(1,"Bahroze",8361)).toDF("UserID","UserLabel","Total")
import org.apache.spark.sql.functions._
val finalDF = visitorDF.as("v").join(conversionsDF.as("c"),Seq("UserID","UserLabel"),"left")
.join(revenueDF.as("r"),Seq("UserID","UserLabel"),"left")
.withColumn("TotalArray",struct($"v.Total".as("Visitor"),$"c.Total".as("Conversions"),$"r.Total".as("Revenue")))
.drop("Total")
display(finalDF)
You can see the output as below :

Pyspark: How to remove an item from a collect_set?

In the following dataframe:
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
("a", "code1", "name"),
("a", "code1", "name2"),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
You can run this command to get a list of distinct values:
df.groupby("id").agg(F.collect_set("code")).show()
+---+-----------------+
| id|collect_set(code)|
+---+-----------------+
| a| [code2, code1]|
+---+-----------------+
How do you remove an item in the above collect_set? E.g. how to remove 'code2'
Update for Spark 2.4+: You can achieve this with array_remove:
df_grouped = df.groupby("id")\
.agg(F.array_remove(F.collect_set("code"), "code2").alias("codes"))
Original answer for Spark 2.3 and below
AFAIK there is no way to dynamically iterate over an ArrayType(), so if your data is already in an array you have two options:
Option 1: Explode, Filter, Collect
Use pyspark.sql.functions.explode() to turn the elements of the array into separate rows. Then use pyspark.sql.DataFrame.where() to filter out the desired values. Finally do a groupBy() and collect_set() to gather the data back into one row.
df_grouped = df.groupby("id").agg(F.collect_set("code").alias("codes"))
df_grouped.select("*", F.explode("codes").alias("exploded"))\
.where(~F.col("exploded").isin(["code2"]))\
.groupBy("id")\
.agg(F.collect_set("exploded").alias("codes"))\
.show()
#+---+-------+
#| id| codes|
#+---+-------+
#| a|[code1]|
#+---+-------+
Option 2: Use a UDF
def filter_code(array):
bad_values={"code2"}
return [x for x in array if x not in bad_values]
filter_code_udf = F.udf(lambda x: filter_code(x), ArrayType(StringType()))
df_grouped = df.groupby("id").agg(F.collect_set("code").alias("codes"))
df_grouped.withColumn("codes_filtered", filter_code_udf("codes")).show()
#+---+--------------+--------------+
#| id| codes|codes_filtered|
#+---+--------------+--------------+
#| a|[code2, code1]| [code1]|
#+---+--------------+--------------+
Of course, if you are starting from your original dataframe (before the groupBy() and collect_set()) you can just filter the desired values first:
df.where(~F.col("code").isin(["code2"])).groupby("id").agg(F.collect_set("code")).show()
#+---+-----------------+
#| id|collect_set(code)|
#+---+-----------------+
#| a| [code1]|
#+---+-----------------+

PySpark: withColumn() with two conditions and three outcomes

I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode:
df = df.withColumn('new_column',
IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)
I am trying to do this in PySpark but I'm not sure about the syntax. Any pointers? I looked into expr() but couldn't get it to work.
Note that df is a pyspark.sql.dataframe.DataFrame.
There are a few efficient ways to implement this. Let's start with required imports:
from pyspark.sql.functions import col, expr, when
You can use Hive IF function inside expr:
new_column_1 = expr(
"""IF(fruit1 IS NULL OR fruit2 IS NULL, 3, IF(fruit1 = fruit2, 1, 0))"""
)
or when + otherwise:
new_column_2 = when(
col("fruit1").isNull() | col("fruit2").isNull(), 3
).when(col("fruit1") == col("fruit2"), 1).otherwise(0)
Finally you could use following trick:
from pyspark.sql.functions import coalesce, lit
new_column_3 = coalesce((col("fruit1") == col("fruit2")).cast("int"), lit(3))
With example data:
df = sc.parallelize([
("orange", "apple"), ("kiwi", None), (None, "banana"),
("mango", "mango"), (None, None)
]).toDF(["fruit1", "fruit2"])
you can use this as follows:
(df
.withColumn("new_column_1", new_column_1)
.withColumn("new_column_2", new_column_2)
.withColumn("new_column_3", new_column_3))
and the result is:
+------+------+------------+------------+------------+
|fruit1|fruit2|new_column_1|new_column_2|new_column_3|
+------+------+------------+------------+------------+
|orange| apple| 0| 0| 0|
| kiwi| null| 3| 3| 3|
| null|banana| 3| 3| 3|
| mango| mango| 1| 1| 1|
| null| null| 3| 3| 3|
+------+------+------------+------------+------------+
You'll want to use a udf as below
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
def func(fruit1, fruit2):
if fruit1 == None or fruit2 == None:
return 3
if fruit1 == fruit2:
return 1
return 0
func_udf = udf(func, IntegerType())
df = df.withColumn('new_column',func_udf(df['fruit1'], df['fruit2']))
The withColumn function in pyspark enables you to make a new variable with conditions, add in the when and otherwise functions and you have a properly working if then else structure.
For all of this you would need to import the sparksql functions, as you will see that the following bit of code will not work without the col() function.
In the first bit, we declare a new column -'new column', and then give the condition enclosed in when function (i.e. fruit1==fruit2) then give 1 if the condition is true, if untrue the control goes to the otherwise which then takes care of the second condition (fruit1 or fruit2 is Null) with the isNull() function and if true 3 is returned and if false, the otherwise is checked again giving 0 as the answer.
from pyspark.sql import functions as F
df=df.withColumn('new_column',
F.when(F.col('fruit1')==F.col('fruit2'), 1)
.otherwise(F.when((F.col('fruit1').isNull()) | (F.col('fruit2').isNull()), 3))
.otherwise(0))

Sampling N rows for every key/value in a column using Pyspark [duplicate]

I'm new to using Spark in Python and have been unable to solve this problem: After running groupBy on a pyspark.sql.dataframe.DataFrame
df = sqlsc.read.json("data.json")
df.groupBy('teamId')
how can you choose N random samples from each resulting group (grouped by teamId) without replacement?
I'm basically trying to choose N random users from each team, maybe using groupBy is wrong to start with?
Well, it is kind of wrong. GroupedData is not really designed for a data access. It just describes grouping criteria and provides aggregation methods. See my answer to Using groupBy in Spark and getting back to a DataFrame for more details.
Another problem with this idea is selecting N random samples. It is a task which is really hard to achieve in parallel without psychical grouping of data and it is not something that happens when you call groupBy on a DataFrame:
There are at least two ways to handle this:
convert to RDD, groupBy and perform local sampling
import random
n = 3
def sample(iter, n):
rs = random.Random() # We should probably use os.urandom as a seed
return rs.sample(list(iter), n)
df = sqlContext.createDataFrame(
[(x, y, random.random()) for x in (1, 2, 3) for y in "abcdefghi"],
("teamId", "x1", "x2"))
grouped = df.rdd.map(lambda row: (row.teamId, row)).groupByKey()
sampled = sqlContext.createDataFrame(
grouped.flatMap(lambda kv: sample(kv[1], n)))
sampled.show()
## +------+---+-------------------+
## |teamId| x1| x2|
## +------+---+-------------------+
## | 1| g| 0.81921738561455|
## | 1| f| 0.8563875814036598|
## | 1| a| 0.9010425238735935|
## | 2| c| 0.3864428179837973|
## | 2| g|0.06233470405822805|
## | 2| d|0.37620872770129155|
## | 3| f| 0.7518901502732027|
## | 3| e| 0.5142305439671874|
## | 3| d| 0.6250620479303716|
## +------+---+-------------------+
use window functions
from pyspark.sql import Window
from pyspark.sql.functions import col, rand, rowNumber
w = Window.partitionBy(col("teamId")).orderBy(col("rnd_"))
sampled = (df
.withColumn("rnd_", rand()) # Add random numbers column
.withColumn("rn_", rowNumber().over(w)) # Add rowNumber over windw
.where(col("rn_") <= n) # Take n observations
.drop("rn_") # drop helper columns
.drop("rnd_"))
sampled.show()
## +------+---+--------------------+
## |teamId| x1| x2|
## +------+---+--------------------+
## | 1| f| 0.8563875814036598|
## | 1| g| 0.81921738561455|
## | 1| i| 0.8173912535268248|
## | 2| h| 0.10862995810038856|
## | 2| c| 0.3864428179837973|
## | 2| a| 0.6695356657072442|
## | 3| b|0.012329360826023095|
## | 3| a| 0.6450777858109182|
## | 3| e| 0.5142305439671874|
## +------+---+--------------------+
but I am afraid both will be rather expensive. If size of the individual groups is balanced and relatively large I would simply use DataFrame.randomSplit.
If number of groups is relatively small it is possible to try something else:
from pyspark.sql.functions import count, udf
from pyspark.sql.types import BooleanType
from operator import truediv
counts = (df
.groupBy(col("teamId"))
.agg(count("*").alias("n"))
.rdd.map(lambda r: (r.teamId, r.n))
.collectAsMap())
# This defines fraction of observations from a group which should
# be taken to get n values
counts_bd = sc.broadcast({k: truediv(n, v) for (k, v) in counts.items()})
to_take = udf(lambda k, rnd: rnd <= counts_bd.value.get(k), BooleanType())
sampled = (df
.withColumn("rnd_", rand())
.where(to_take(col("teamId"), col("rnd_")))
.drop("rnd_"))
sampled.show()
## +------+---+--------------------+
## |teamId| x1| x2|
## +------+---+--------------------+
## | 1| d| 0.14815204548854788|
## | 1| f| 0.8563875814036598|
## | 1| g| 0.81921738561455|
## | 2| a| 0.6695356657072442|
## | 2| d| 0.37620872770129155|
## | 2| g| 0.06233470405822805|
## | 3| b|0.012329360826023095|
## | 3| h| 0.9022527556458557|
## +------+---+--------------------+
In Spark 1.5+ you can replace udf with a call to sampleBy method:
df.sampleBy("teamId", counts_bd.value)
It won't give you exact number of observations but should be good enough most of the time as long as a number of observations per group is large enough to get proper samples. You can also use sampleByKey on a RDD in a similar way.
I found this one more dataframey, rather than going into rdd way.
You can use window function to create ranking within a group, where ranking can be random to suit your case. Then, you can filter based on the number of samples (N) you want for each group
window_1 = Window.partitionBy(data['teamId']).orderBy(F.rand())
data_1 = data.select('*', F.rank().over(window_1).alias('rank')).filter(F.col('rank') <= N).drop('rank')
Here's an alternative using Pandas DataFrame.Sample method. This uses the spark applyInPandas method to distribute the groups, available from Spark 3.0.0. This allows you to select an exact number of rows per group.
I've added args and kwargs to the function so you can access the other arguments of DataFrame.Sample.
def sample_n_per_group(n, *args, **kwargs):
def sample_per_group(pdf):
return pdf.sample(n, *args, **kwargs)
return sample_per_group
df = spark.createDataFrame(
[
(1, 1.0),
(1, 2.0),
(2, 3.0),
(2, 5.0),
(2, 10.0)
],
("id", "v")
)
(df.groupBy("id")
.applyInPandas(
sample_n_per_group(2, random_state=2),
schema=df.schema
)
)
To be aware of the limitations for very large groups, from the documentation:
This function requires a full shuffle. All the data of a group will be
loaded into memory, so the user should be aware of the potential OOM
risk if data is skewed and certain groups are too large to fit in
memory.
See also here:
How take a random row from a PySpark DataFrame?

Resources