SPARK 3 - Populate value with value from previous rows (lookup) - apache-spark

I am new to SPARK. I have 2 dataframes events and players
events dataframe consists of columns
event_id| player_id| match_id| impact_score
players dataframe consists of columns
player_id| player_name| nationality
I am merging the two datasets by player_id with this query:
df_final = (events
.orderBy("player_id")
.join(players.orderBy("player_id"))
.withColumn("current_team", when([no idea what goes in here]).otherwise(getCurrentTeam(col("player_id"))))
.write.mode("overwrite")
.partitionBy("current_team")
)
getCurrentTeam function triggers an HTTP call that returns a value (player's current team).
I have data of over 30 million soccer plays and 97 players. I need help creating column current_team. Imagine certain player appearing 130,000 times in events dataframe. I need to lookup values from previous rows. If player appears, I just grab that value (like an in-memory catalog). If it does not appear, then I call the webservice.

Due to it's distributed nature, Spark can't allow for if allow populated in previous call then use it otherwise call created value. There are two possible options.
Since you are applying an inner join and players df has the list of all distinct players, you can add the current_team column to this df before applying a join. If the players df is cached before joining then it's possible that the UDF is invoked only once for each player. See discussion here for why UDF can be called multiple time for each record.
You can memoize getCurrentTeam
Working Example - Prepopulate current_team
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
events_data = [(1, 1, 1, 10), (1, 2, 1, 20, ), (1, 3, 1, 30, ), (2, 3, 1, 30, ), (2, 1, 1, 10), (2, 2, 1, 20, ), ]
players_data = [(1, "Player1", "Nat", ), (2, "Player2", "Nat", ), (3, "Player3", "Nat", ), ]
events = spark.createDataFrame(events_data, ("event_id", "player_id", "match_id", "impact_score", ), ).repartition(3)
players = spark.createDataFrame(players_data, ("player_id", "player_name", "nationality", ), ).repartition(3)
#udf(StringType())
def getCurrentTeam(player_id):
return f"player_{player_id}_team"
players_with_current_team = players.withColumn("current_team", getCurrentTeam(F.col("player_id"))).cache()
events.join(players_with_current_team, ["player_id"]).show()
Output
+---------+--------+--------+------------+-----------+-----------+-------------+
|player_id|event_id|match_id|impact_score|player_name|nationality| current_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
| 2| 2| 1| 20| Player2| Nat|player_2_team|
| 2| 1| 1| 20| Player2| Nat|player_2_team|
| 3| 2| 1| 30| Player3| Nat|player_3_team|
| 3| 1| 1| 30| Player3| Nat|player_3_team|
| 1| 2| 1| 10| Player1| Nat|player_1_team|
| 1| 1| 1| 10| Player1| Nat|player_1_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
Working Example - Memoization
I have used a python dict for mimicing caching and using an accumulator to count number of mimicked network calls made.
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import time
events_data = [(1, 1, 1, 10), (1, 2, 1, 20, ), (1, 3, 1, 30, ), (2, 3, 1, 30, ), (2, 1, 1, 10), (2, 2, 1, 20, ), ]
players_data = [(1, "Player1", "Nat", ), (2, "Player2", "Nat", ), (3, "Player3", "Nat", ), ]
events = spark.createDataFrame(events_data, ("event_id", "player_id", "match_id", "impact_score", ), ).repartition(3)
players = spark.createDataFrame(players_data, ("player_id", "player_name", "nationality", ), ).repartition(3)
players_events_joined = events.join(players, ["player_id"])
memoized_call_counter = spark.sparkContext.accumulator(0)
def memoize_call():
cache = {}
def getCurrentTeam(player_id):
global memoized_call_counter
cached_value = cache.get(player_id, None)
if cached_value is not None:
return cached_value
# sleep to mimic network call
time.sleep(1)
# Increment counter everytime cached value can't be lookedup
memoized_call_counter.add(1)
cache[player_id] = f"player_{player_id}_team"
return cache[player_id]
return getCurrentTeam
getCurrentTeam_udf = udf(memoize_call(), StringType())
players_events_joined.withColumn("current_team", getCurrentTeam_udf(F.col("player_id"))).show()
Output
+---------+--------+--------+------------+-----------+-----------+-------------+
|player_id|event_id|match_id|impact_score|player_name|nationality| current_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
| 2| 2| 1| 20| Player2| Nat|player_2_team|
| 2| 1| 1| 20| Player2| Nat|player_2_team|
| 3| 2| 1| 30| Player3| Nat|player_3_team|
| 3| 1| 1| 30| Player3| Nat|player_3_team|
| 1| 2| 1| 10| Player1| Nat|player_1_team|
| 1| 1| 1| 10| Player1| Nat|player_1_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
>>> memoized_call_counter.value
3
Since there are 3 unique players in total the logic after time.sleep(1) was called only thrice. The number of calls is dependent on the number of workers, since the cache is not shared across workers. As I ran the example in local mode (wuth 1 worker) we see that the number of calls is equal to number of workers.

Related

How to randomize different numbers for subgroup of rows pyspark

I have a pyspark dataframe. I need to randomize values taken from list for all rows within given condition. I did:
df = df.withColumn('rand_col', f.when(f.col('condition_col') == condition, random.choice(my_list)))
but the effect is, that it randomizes only one value and assigns it to all rows:
How can I randomize separately for each row?
You can:
use rand and floor from pyspark.sql.functions to create a random indexing column to index into your my_list
create a column in which the my_list value is repeated
index into that column using f.col
It would look something like this:
import pyspark.sql.functions as f
my_list = [1, 2, 30]
df = spark.createDataFrame(
[
(1, 0),
(2, 1),
(3, 1),
(4, 0),
(5, 1),
(6, 1),
(7, 0),
],
["id", "condition"]
)
df = df.withColumn('rand_index', f.when(f.col('condition') == 1, f.floor(f.rand() * len(my_list))))\
.withColumn('my_list', f.array([f.lit(x) for x in my_list]))\
.withColumn('rand_value', f.when(f.col('condition') == 1, f.col("my_list")[f.col("rand_index")]))
df.show()
+---+---------+----------+----------+----------+
| id|condition|rand_index| my_list|rand_value|
+---+---------+----------+----------+----------+
| 1| 0| null|[1, 2, 30]| null|
| 2| 1| 0|[1, 2, 30]| 1|
| 3| 1| 2|[1, 2, 30]| 30|
| 4| 0| null|[1, 2, 30]| null|
| 5| 1| 1|[1, 2, 30]| 2|
| 6| 1| 2|[1, 2, 30]| 30|
| 7| 0| null|[1, 2, 30]| null|
+---+---------+----------+----------+----------+

PySpark - create multiple aggregative map columns without using UDF or join

I have a huge dataframe that looks similar to this:
+----+-------+-------+-----+
|name|level_A|level_B|hours|
+----+-------+-------+-----+
| Bob| 10| 3| 5|
| Bob| 10| 3| 15|
| Bob| 20| 3| 25|
| Sue| 30| 3| 35|
| Sue| 30| 7| 45|
+----+-------+-------+-----+
My desired output:
+----+--------------------+------------------+
|name| map_level_A| map_level_B|
+----+--------------------+------------------+
| Bob|{10 -> 20, 20 -> 25}| {3 -> 45}|
| Sue| {30 -> 80}|{7 -> 45, 3 -> 35}|
+----+--------------------+------------------+
Meaning, group by name, adding 2 MapType columns that map level_A and level_B to the sum of hours.
I know I can get that output using an UDF or a join operation.
However, in practice, the data is very big, and it's not 2 map columns, but rather tens of them, so join/UDF are just too costly.
Is there a more efficient way to do that?
You could consider using Window functions. You'll need a windowspec for each level_X partitioned by both name and level_X to calculate the sum of hours. Then group by name and create map from array of structs:
from pyspark.sql import Window
import pyspark.sql.functions as F
df = spark.createDataFrame([("Bob", 10, 3, 5), ("Bob", 10, 3, 15), ("Bob", 20, 3, 25),
("Sue", 30, 3, 35),("Sue", 30, 7, 45), ],
["name", "level_A", "level_B", "hours"])
wla = Window.partitionBy("name", "level_A")
wlb = Window.partitionBy("name", "level_B")
result = df.withColumn("hours_A", F.sum("hours").over(wla)) \
.withColumn("hours_B", F.sum("hours").over(wlb)) \
.groupBy("name") \
.agg(
F.map_from_entries(
F.collect_set(F.struct(F.col("level_A"), F.col("hours_A")))
).alias("map_level_A"),
F.map_from_entries(
F.collect_set(F.struct(F.col("level_B"), F.col("hours_B")))
).alias("map_level_B")
)
result.show()
#+----+--------------------+------------------+
#|name| map_level_A| map_level_B|
#+----+--------------------+------------------+
#| Sue| {30 -> 80}|{3 -> 35, 7 -> 45}|
#| Bob|{10 -> 20, 20 -> 25}| {3 -> 45}|
#+----+--------------------+------------------+

Pyspark: reshape data without aggregation

I want to reshape my data from 4x3 to 2x2 in pyspark without aggregating. My current output is the following:
columns = ['FAULTY', 'value_HIGH', 'count']
vals = [
(1, 0, 141),
(0, 0, 140),
(1, 1, 21),
(0, 1, 12)
]
What I want is a contingency table with the second column as two new binary columns (value_HIGH_1, value_HIGH_0) and the values from the count column - meaning:
columns = ['FAULTY', 'value_HIGH_1', 'value_HIGH_0']
vals = [
(1, 21, 141),
(0, 12, 140)
]
You can use pivot with a fake maximum aggregation (since you have only one element for each group):
import pyspark.sql.functions as F
df.groupBy('FAULTY').pivot('value_HIGH').agg(F.max('count')).selectExpr(
'FAULTY', '`1` as value_high_1', '`0` as value_high_0'
).show()
+------+------------+------------+
|FAULTY|value_high_1|value_high_0|
+------+------------+------------+
| 0| 12| 140|
| 1| 21| 141|
+------+------------+------------+
Using groupby and pivot is the natural way to do this, but if you want to avoid any aggregation you can achieve this with a filter and join
import pyspark.sql.functions as f
df.where("value_HIGH = 1").select("FAULTY", f.col("count").alias("value_HIGH_1"))\
.join(
df.where("value_HIGH = 0").select("FAULTY", f.col("count").alias("value_HIGH_1")),
on="FAULTY"
)\
.show()
#+------+------------+------------+
#|FAULTY|value_HIGH_1|value_HIGH_1|
#+------+------------+------------+
#| 0| 12| 140|
#| 1| 21| 141|
#+------+------------+------------+

Pyspark find friendship pairs from friendship lists

I currently have data describing single directional friendship such as below:
For the first line, it means 1 added 3, 4, 8 as friends but doesn't know their responses, and if 3 added 1 as friend as well, they become a pair.
ID friendsList
1 [3, 4, 8]
2 [8]
3 [1]
4 [1]
5 [6]
6 [7]
7 [1]
8 [1, 2, 4]
How can I use PySpark and PySpark SQL to generate friendship pair that both of them are bi-directional friends? Sample output(distinct or not doesn't matter):
(1, 4)
(1, 8)
(1, 3)
(2, 8)
(3, 1)
(4, 1)
(8, 1)
(8, 2)
Thanks!
This can be achieved by explode function and self join as shown below.
from pyspark.sql.functions import explode
df = spark.createDataFrame(((1,[3, 4, 8]),(2,[8]),(3,[1]),(4,[1]),(5,[6]),(6,[7]),(7,[1]),(8,[1, 2, 4])),["c1",'c2'])
df.withColumn('c2',explode(df['c2'])).createOrReplaceTempView('table1')
>>> spark.sql("SELECT t0.c1,t0.c2 FROM table1 t0 INNER JOIN table1 t1 ON t0.c1 = t1.c2 AND t0.c2 = t1.c1").show()
+---+---+
| c1| c2|
+---+---+
| 1| 3|
| 8| 1|
| 1| 4|
| 2| 8|
| 4| 1|
| 8| 2|
| 3| 1|
| 1| 8|
+---+---+
use below if Dataframe API is preferred over spark SQL.
df = df.withColumn('c2',explode(df['c2']))
df.alias('df1') \
.join(df.alias('df2'),((col('df1.c1') == col('df2.c2')) & (col('df2.c1') == col('df1.c2')))) \
.select(col('df1.c1'),col('df1.c2'))

How to get the min of each row in PySpark DataFrame [duplicate]

I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.
For example:
Input: PySpark DataFrame containing :
col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]
Ouput :
col_4 = max(col1, col_2, col_3) = [3,2,5]
There is something similar in pandas as explained in this question.
Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?
You can reduce using SQL expressions over a list of columns:
from pyspark.sql.functions import max as max_, col, when
from functools import reduce
def row_max(*cols):
return reduce(
lambda x, y: when(x > y, x).otherwise(y),
[col(c) if isinstance(c, str) else c for c in cols]
)
df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
.toDF(["a", "b", "c"]))
df.select(row_max("a", "b", "c").alias("max")))
Spark 1.5+ also provides least, greatest
from pyspark.sql.functions import greatest
df.select(greatest("a", "b", "c"))
If you want to keep name of the max you can use `structs:
from pyspark.sql.functions import struct, lit
def row_max_with_name(*cols):
cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))
maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))
And finally you can use above to find select "top" column:
from pyspark.sql.functions import max
((_, c), ) = (maxs
.groupBy(col("maxs")["col"].alias("col"))
.count()
.agg(max(struct(col("count"), col("col"))))
.first())
df.select(c)
We can use greatest
Creating DataFrame
df = spark.createDataFrame(
[[1,2,3], [2,1,2], [3,4,5]],
['col_1','col_2','col_3']
)
df.show()
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| 1| 2| 3|
| 2| 1| 2|
| 3| 4| 5|
+-----+-----+-----+
Solution
from pyspark.sql.functions import greatest
df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))
#Only if you need col
#from pyspark.sql.functions import col
#df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
df2.show()
+-----+-----+-----+-----------+
|col_1|col_2|col_3|max_by_rows|
+-----+-----+-----+-----------+
| 1| 2| 3| 3|
| 2| 1| 2| 2|
| 3| 4| 5| 5|
+-----+-----+-----+-----------+
You can also use the pyspark built-in least:
from pyspark.sql.functions import least, col
df = df.withColumn('min', least(col('c1'), col('c2'), col('c3')))
Another simple way of doing it. Let us say that the below df is your dataframe
df = sc.parallelize([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)]).toDF(["c1", "c2", "c3"])
df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10| 1|
|200| 2| 20|
| 3| 30|300|
|400| 40| 4|
+---+---+---+
You can process the above df as below to get the desited results
from pyspark.sql.functions import lit, min
df.select( lit('c1').alias('cn1'), min(df.c1).alias('c1'),
lit('c2').alias('cn2'), min(df.c2).alias('c2'),
lit('c3').alias('cn3'), min(df.c3).alias('c3')
)\
.rdd.flatMap(lambda r: [ (r.cn1, r.c1), (r.cn2, r.c2), (r.cn3, r.c3)])\
.toDF(['Columnn', 'Min']).show()
+-------+---+
|Columnn|Min|
+-------+---+
| c1| 3|
| c2| 2|
| c3| 1|
+-------+---+
Scala solution:
df = sc.parallelize(Seq((10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4))).toDF("c1", "c2", "c3"))
df.rdd.map(row=>List[String](row(0).toString,row(1).toString,row(2).toString)).map(x=>(x(0),x(1),x(2),x.min)).toDF("c1","c2","c3","min").show
+---+---+---+---+
| c1| c2| c3|min|
+---+---+---+---+
| 10| 10| 1| 1|
|200| 2| 20| 2|
| 3| 30|300| 3|
|400| 40| 4| 4|
+---+---+---+---+

Resources