Sampling N rows for every key/value in a column using Pyspark [duplicate] - apache-spark

I'm new to using Spark in Python and have been unable to solve this problem: After running groupBy on a pyspark.sql.dataframe.DataFrame
df = sqlsc.read.json("data.json")
df.groupBy('teamId')
how can you choose N random samples from each resulting group (grouped by teamId) without replacement?
I'm basically trying to choose N random users from each team, maybe using groupBy is wrong to start with?

Well, it is kind of wrong. GroupedData is not really designed for a data access. It just describes grouping criteria and provides aggregation methods. See my answer to Using groupBy in Spark and getting back to a DataFrame for more details.
Another problem with this idea is selecting N random samples. It is a task which is really hard to achieve in parallel without psychical grouping of data and it is not something that happens when you call groupBy on a DataFrame:
There are at least two ways to handle this:
convert to RDD, groupBy and perform local sampling
import random
n = 3
def sample(iter, n):
rs = random.Random() # We should probably use os.urandom as a seed
return rs.sample(list(iter), n)
df = sqlContext.createDataFrame(
[(x, y, random.random()) for x in (1, 2, 3) for y in "abcdefghi"],
("teamId", "x1", "x2"))
grouped = df.rdd.map(lambda row: (row.teamId, row)).groupByKey()
sampled = sqlContext.createDataFrame(
grouped.flatMap(lambda kv: sample(kv[1], n)))
sampled.show()
## +------+---+-------------------+
## |teamId| x1| x2|
## +------+---+-------------------+
## | 1| g| 0.81921738561455|
## | 1| f| 0.8563875814036598|
## | 1| a| 0.9010425238735935|
## | 2| c| 0.3864428179837973|
## | 2| g|0.06233470405822805|
## | 2| d|0.37620872770129155|
## | 3| f| 0.7518901502732027|
## | 3| e| 0.5142305439671874|
## | 3| d| 0.6250620479303716|
## +------+---+-------------------+
use window functions
from pyspark.sql import Window
from pyspark.sql.functions import col, rand, rowNumber
w = Window.partitionBy(col("teamId")).orderBy(col("rnd_"))
sampled = (df
.withColumn("rnd_", rand()) # Add random numbers column
.withColumn("rn_", rowNumber().over(w)) # Add rowNumber over windw
.where(col("rn_") <= n) # Take n observations
.drop("rn_") # drop helper columns
.drop("rnd_"))
sampled.show()
## +------+---+--------------------+
## |teamId| x1| x2|
## +------+---+--------------------+
## | 1| f| 0.8563875814036598|
## | 1| g| 0.81921738561455|
## | 1| i| 0.8173912535268248|
## | 2| h| 0.10862995810038856|
## | 2| c| 0.3864428179837973|
## | 2| a| 0.6695356657072442|
## | 3| b|0.012329360826023095|
## | 3| a| 0.6450777858109182|
## | 3| e| 0.5142305439671874|
## +------+---+--------------------+
but I am afraid both will be rather expensive. If size of the individual groups is balanced and relatively large I would simply use DataFrame.randomSplit.
If number of groups is relatively small it is possible to try something else:
from pyspark.sql.functions import count, udf
from pyspark.sql.types import BooleanType
from operator import truediv
counts = (df
.groupBy(col("teamId"))
.agg(count("*").alias("n"))
.rdd.map(lambda r: (r.teamId, r.n))
.collectAsMap())
# This defines fraction of observations from a group which should
# be taken to get n values
counts_bd = sc.broadcast({k: truediv(n, v) for (k, v) in counts.items()})
to_take = udf(lambda k, rnd: rnd <= counts_bd.value.get(k), BooleanType())
sampled = (df
.withColumn("rnd_", rand())
.where(to_take(col("teamId"), col("rnd_")))
.drop("rnd_"))
sampled.show()
## +------+---+--------------------+
## |teamId| x1| x2|
## +------+---+--------------------+
## | 1| d| 0.14815204548854788|
## | 1| f| 0.8563875814036598|
## | 1| g| 0.81921738561455|
## | 2| a| 0.6695356657072442|
## | 2| d| 0.37620872770129155|
## | 2| g| 0.06233470405822805|
## | 3| b|0.012329360826023095|
## | 3| h| 0.9022527556458557|
## +------+---+--------------------+
In Spark 1.5+ you can replace udf with a call to sampleBy method:
df.sampleBy("teamId", counts_bd.value)
It won't give you exact number of observations but should be good enough most of the time as long as a number of observations per group is large enough to get proper samples. You can also use sampleByKey on a RDD in a similar way.

I found this one more dataframey, rather than going into rdd way.
You can use window function to create ranking within a group, where ranking can be random to suit your case. Then, you can filter based on the number of samples (N) you want for each group
window_1 = Window.partitionBy(data['teamId']).orderBy(F.rand())
data_1 = data.select('*', F.rank().over(window_1).alias('rank')).filter(F.col('rank') <= N).drop('rank')

Here's an alternative using Pandas DataFrame.Sample method. This uses the spark applyInPandas method to distribute the groups, available from Spark 3.0.0. This allows you to select an exact number of rows per group.
I've added args and kwargs to the function so you can access the other arguments of DataFrame.Sample.
def sample_n_per_group(n, *args, **kwargs):
def sample_per_group(pdf):
return pdf.sample(n, *args, **kwargs)
return sample_per_group
df = spark.createDataFrame(
[
(1, 1.0),
(1, 2.0),
(2, 3.0),
(2, 5.0),
(2, 10.0)
],
("id", "v")
)
(df.groupBy("id")
.applyInPandas(
sample_n_per_group(2, random_state=2),
schema=df.schema
)
)
To be aware of the limitations for very large groups, from the documentation:
This function requires a full shuffle. All the data of a group will be
loaded into memory, so the user should be aware of the potential OOM
risk if data is skewed and certain groups are too large to fit in
memory.
See also here:
How take a random row from a PySpark DataFrame?

Related

How to translate SQL UPDATE query which uses inner join into PySpark?

I have two MS Access SQL queries which I want to convert into PySpark. The queries look like this (we have two tables Employee and Department):
UPDATE EMPLOYEE INNER JOIN [DEPARTMENT] ON
EMPLOYEE.STATEPROVINCE = [DEPARTMENT].[STATE_LEVEL]
SET EMPLOYEE.STATEPROVINCE = [DEPARTMENT]![STATE_ABBREVIATION];
UPDATE EMPLOYEE INNER JOIN [DEPARTMENT] ON
EMPLOYEE.STATEPROVINCE = [DEPARTMENT].[STATE_LEVEL]
SET EMPLOYEE.MARKET = [DEPARTMENT]![MARKET];
Test dataframes:
from pyspark.sql import functions as F
df_emp = spark.createDataFrame([(1, 'a'), (2, 'bb')], ['EMPLOYEE', 'STATEPROVINCE'])
df_emp.show()
# +--------+-------------+
# |EMPLOYEE|STATEPROVINCE|
# +--------+-------------+
# | 1| a|
# | 2| bb|
# +--------+-------------+
df_dept = spark.createDataFrame([('bb', 'b')], ['STATE_LEVEL', 'STATE_ABBREVIATION'])
df_dept.show()
# +-----------+------------------+
# |STATE_LEVEL|STATE_ABBREVIATION|
# +-----------+------------------+
# | bb| b|
# +-----------+------------------+
Running your SQL query in Microsoft Access does the following:
In PySpark, you can get it like this:
df = (df_emp.alias('a')
.join(df_dept.alias('b'), df_emp.STATEPROVINCE == df_dept.STATE_LEVEL, 'left')
.select(
*[c for c in df_emp.columns if c != 'STATEPROVINCE'],
F.coalesce('b.STATE_ABBREVIATION', 'a.STATEPROVINCE').alias('STATEPROVINCE')
)
)
df.show()
# +--------+-------------+
# |EMPLOYEE|STATEPROVINCE|
# +--------+-------------+
# | 1| a|
# | 2| b|
# +--------+-------------+
First you do a left join. Then, select.
The select has 2 parts.
First, you select everything from df_emp except for "STATEPROVINCE".
Then, for the new "STATEPROVINCE", you select "STATE_ABBREVIATION" from df_dept, but in case it's null (i.e. not existent in df_dept), you take "STATEPROVINCE" from df_emp.
For your second query, you only need to change values in the select statement:
df = (df_emp.alias('a')
.join(df_dept.alias('b'), df_emp.STATEPROVINCE == df_dept.STATE_LEVEL, 'left')
.select(
*[c for c in df_emp.columns if c != 'MARKET'],
F.coalesce('b.MARKET', 'a.MARKET').alias('MARKET')
)
)

Create sequential unique id for each group

I'm trying to find an equivalent for the following snippet (reference) to create unique id to every unique combination from two columns in PySpark.
Pandas approach:
df['my_id'] = df.groupby(['foo', 'bar'], sort=False).ngroup() + 1
I tried the following, but it's creating more ids than required:
df = df.withColumn("my_id", F.row_number().over(Window.orderBy('foo', 'bar')))
Instead of row_number, use dense_rank:
from pyspark.sql import functions as F, Window
df = spark.createDataFrame(
[('r1', 'ph1'),
('r1', 'ph1'),
('r1', 'ph2'),
('s4', 'ph3'),
('s3', 'ph2'),
('s3', 'ph2')],
['foo', 'bar'])
df = df.withColumn("my_id", F.dense_rank().over(Window.orderBy('foo', 'bar')))
df.show()
# +---+---+-----+
# |foo|bar|my_id|
# +---+---+-----+
# | r1|ph1| 1|
# | r1|ph1| 1|
# | r1|ph2| 2|
# | s3|ph2| 3|
# | s3|ph2| 3|
# | s4|ph3| 4|
# +---+---+-----+

Merging two or more dataframes/rdd efficiently in PySpark

I'm trying to merge three RDD's based on the same key. The following is the data.
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 15|
| 3| Candy| 15|
| 1| Bahroze| 15|
+------+---------+-----+
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 7342|
| 3| Candy| 5669|
| 1| Bahroze| 8361|
+------+---------+-----+
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 37|
| 3| Candy| 27|
| 1| Bahroze| 39|
+------+---------+-----+
I'm able to merge these three DF. I converted them to RDD dict with the following code for all three
new_rdd = userTotalVisits.rdd.map(lambda row: row.asDict(True))
After RDD conversion, I'm taking one RDD and the other two as lists. Mapping the first RDD and then adding other keys to it based on the same UserID. I was hoping there was a better way of doing this using pyspark. Here's the code I've written.
def transform(row):
# Add a new key to each row
for x in conversion_list: # first rdd in list of object as[{}] after using collect()
if( x['UserID'] == row['UserID'] ):
row["Total"] = { "Visitors": row["Total"], "Conversions": x["Total"] }
for y in Revenue_list: # second rdd in list of object as[{}] after using collect()
if( y['UserID'] == row['UserID'] ):
row["Total"]["Revenue"] = y["Total"]
return row
potato = new_rdd.map(lambda row: transform(row)) #first rdd
How should I efficiently merge these three RDDs/DFs? (because I had to perform three different task on a huge DF). Looking for a better efficient idea. PS I'm still spark newbie. The result of my code does is as follows which is what I need.
{'UserID': '2', 'UserLabel': 'Panda', 'Total': {'Visitors': 37, 'Conversions': 15, 'Revenue': 7342}}
{'UserID': '3', 'UserLabel': 'Candy', 'Total': {'Visitors': 27, 'Conversions': 15, 'Revenue': 5669}}
{'UserID': '1', 'UserLabel': 'Bahroze', 'Total': {'Visitors': 39, 'Conversions': 15, 'Revenue': 8361}}
Thank you.
You can join the 3 dataframes on columns ["UserID", "UserLabel"], create a new struct total from the 3 total columns:
from pyspark.sql import functions as F
result = df1.alias("conv") \
.join(df2.alias("rev"), ["UserID", "UserLabel"], "left") \
.join(df3.alias("visit"), ["UserID", "UserLabel"], "left") \
.select(
F.col("UserID"),
F.col("UserLabel"),
F.struct(
F.col("conv.Total").alias("Conversions"),
F.col("rev.Total").alias("Revenue"),
F.col("visit.Total").alias("Visitors")
).alias("Total")
)
# write into json file
result.write.json("output")
# print result:
for i in result.toJSON().collect():
print(i)
# {"UserID":3,"UserLabel":"Candy","Total":{"Conversions":15,"Revenue":5669,"Visitors":27}}
# {"UserID":1,"UserLabel":"Bahroze","Total":{"Conversions":15,"Revenue":8361,"Visitors":39}}
# {"UserID":2,"UserLabel":"Panda","Total":{"Conversions":15,"Revenue":7342,"Visitors":37}}
You can just do the left joins on all the three dataframes but make sure the first dataframe that you use has all the UserID and UserLabel Values. You can ignore the GroupBy operation as suggested by #blackbishop and still it would give you the required output
I am showing how it can be done in scala but you could do something similar in python.
//source data
val visitorDF = Seq((2,"Panda",15),(3,"Candy",15),(1,"Bahroze",15),(4,"Test",25)).toDF("UserID","UserLabel","Total")
val conversionsDF = Seq((2,"Panda",37),(3,"Candy",27),(1,"Bahroze",39)).toDF("UserID","UserLabel","Total")
val revenueDF = Seq((2,"Panda",7342),(3,"Candy",5669),(1,"Bahroze",8361)).toDF("UserID","UserLabel","Total")
import org.apache.spark.sql.functions._
val finalDF = visitorDF.as("v").join(conversionsDF.as("c"),Seq("UserID","UserLabel"),"left")
.join(revenueDF.as("r"),Seq("UserID","UserLabel"),"left")
.withColumn("TotalArray",struct($"v.Total".as("Visitor"),$"c.Total".as("Conversions"),$"r.Total".as("Revenue")))
.drop("Total")
display(finalDF)
You can see the output as below :

PySpark: withColumn() with two conditions and three outcomes

I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode:
df = df.withColumn('new_column',
IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)
I am trying to do this in PySpark but I'm not sure about the syntax. Any pointers? I looked into expr() but couldn't get it to work.
Note that df is a pyspark.sql.dataframe.DataFrame.
There are a few efficient ways to implement this. Let's start with required imports:
from pyspark.sql.functions import col, expr, when
You can use Hive IF function inside expr:
new_column_1 = expr(
"""IF(fruit1 IS NULL OR fruit2 IS NULL, 3, IF(fruit1 = fruit2, 1, 0))"""
)
or when + otherwise:
new_column_2 = when(
col("fruit1").isNull() | col("fruit2").isNull(), 3
).when(col("fruit1") == col("fruit2"), 1).otherwise(0)
Finally you could use following trick:
from pyspark.sql.functions import coalesce, lit
new_column_3 = coalesce((col("fruit1") == col("fruit2")).cast("int"), lit(3))
With example data:
df = sc.parallelize([
("orange", "apple"), ("kiwi", None), (None, "banana"),
("mango", "mango"), (None, None)
]).toDF(["fruit1", "fruit2"])
you can use this as follows:
(df
.withColumn("new_column_1", new_column_1)
.withColumn("new_column_2", new_column_2)
.withColumn("new_column_3", new_column_3))
and the result is:
+------+------+------------+------------+------------+
|fruit1|fruit2|new_column_1|new_column_2|new_column_3|
+------+------+------------+------------+------------+
|orange| apple| 0| 0| 0|
| kiwi| null| 3| 3| 3|
| null|banana| 3| 3| 3|
| mango| mango| 1| 1| 1|
| null| null| 3| 3| 3|
+------+------+------------+------------+------------+
You'll want to use a udf as below
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
def func(fruit1, fruit2):
if fruit1 == None or fruit2 == None:
return 3
if fruit1 == fruit2:
return 1
return 0
func_udf = udf(func, IntegerType())
df = df.withColumn('new_column',func_udf(df['fruit1'], df['fruit2']))
The withColumn function in pyspark enables you to make a new variable with conditions, add in the when and otherwise functions and you have a properly working if then else structure.
For all of this you would need to import the sparksql functions, as you will see that the following bit of code will not work without the col() function.
In the first bit, we declare a new column -'new column', and then give the condition enclosed in when function (i.e. fruit1==fruit2) then give 1 if the condition is true, if untrue the control goes to the otherwise which then takes care of the second condition (fruit1 or fruit2 is Null) with the isNull() function and if true 3 is returned and if false, the otherwise is checked again giving 0 as the answer.
from pyspark.sql import functions as F
df=df.withColumn('new_column',
F.when(F.col('fruit1')==F.col('fruit2'), 1)
.otherwise(F.when((F.col('fruit1').isNull()) | (F.col('fruit2').isNull()), 3))
.otherwise(0))

Aggregate First Grouped Item from Subsequent Items

I have user game sessions containing: user id, game id, score and a timestamp when the game was played.
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("u1", "g1", 10, 0),
("u1", "g3", 2, 2),
("u1", "g3", 5, 3),
("u1", "g4", 5, 4),
("u2", "g2", 1, 1),
], ["UserID", "GameID", "Score", "Time"])
Desired Output
+------+-------------+-------------+
|UserID|MaxScoreGame1|MaxScoreGame2|
+------+-------------+-------------+
| u1| 10| 5|
| u2| 1| null|
+------+-------------+-------------+
I want to transform the data such that I get the max score of the first game the user played as well as the max score of the second game (bonus if I can also get the max score of all subsequent games). Unfortunately I'm not sure how that's possible to do with Spark SQL.
I know I can group by UserID, GameID and then agg to get the max score and min time. Not sure to how to proceed from there.
Clarification: note that MaxScoreGame1 and MaxScoreGame2 refer to the first and second game user player; not the GameID.
You could try using a combination of Window functions and Pivot.
Get the row number for every game partitioned by UserID ordered by Time.
Filter down to GameNumber being 1 or 2.
Pivot on that to get your desired output shape.
Unfortunately I am using scala not python, but the below should be fairly easily transferable to python library.
import org.apache.spark.sql.expressions.Window
// Use a window function to get row number
val rowNumberWindow = Window.partitionBy(col("UserId")).orderBy(col("Time"))
val output = {
df
.select(
col("*"),
row_number().over(rowNumberWindow).alias("GameNumber")
)
.filter(col("GameNumber") <= lit(2))
.groupBy(col("UserId"))
.pivot("GameNumber")
.agg(
sum(col("Score"))
)
}
output.show()
+------+---+----+
|UserId| 1| 2|
+------+---+----+
| u1| 10| 2|
| u2| 1|null|
+------+---+----+
Solution with PySpark:
from pyspark.sql import Window
rowNumberWindow = Window.partitionBy("UserID").orderBy(F.col("Time"))
(df
.groupBy("UserID", "GameID")
.agg(F.max("Score").alias("Score"),
F.min("Time").alias("Time"))
.select(F.col("*"),
F.row_number().over(rowNumberWindow).alias("GameNumber"))
.filter(F.col("GameNumber") <= F.lit(2))
.withColumn("GameMaxScoreCol", F.concat(F.lit("MaxScoreGame"), F.col("GameNumber")))
.groupBy("UserID")
.pivot("GameMaxScoreCol")
.agg(F.max("Score"))
).show()
+------+-------------+-------------+
|UserID|MaxScoreGame1|MaxScoreGame2|
+------+-------------+-------------+
| u1| 10| 5|
| u2| 1| null|
+------+-------------+-------------+

Resources