I have a huge dataframe that looks similar to this:
+----+-------+-------+-----+
|name|level_A|level_B|hours|
+----+-------+-------+-----+
| Bob| 10| 3| 5|
| Bob| 10| 3| 15|
| Bob| 20| 3| 25|
| Sue| 30| 3| 35|
| Sue| 30| 7| 45|
+----+-------+-------+-----+
My desired output:
+----+--------------------+------------------+
|name| map_level_A| map_level_B|
+----+--------------------+------------------+
| Bob|{10 -> 20, 20 -> 25}| {3 -> 45}|
| Sue| {30 -> 80}|{7 -> 45, 3 -> 35}|
+----+--------------------+------------------+
Meaning, group by name, adding 2 MapType columns that map level_A and level_B to the sum of hours.
I know I can get that output using an UDF or a join operation.
However, in practice, the data is very big, and it's not 2 map columns, but rather tens of them, so join/UDF are just too costly.
Is there a more efficient way to do that?
You could consider using Window functions. You'll need a windowspec for each level_X partitioned by both name and level_X to calculate the sum of hours. Then group by name and create map from array of structs:
from pyspark.sql import Window
import pyspark.sql.functions as F
df = spark.createDataFrame([("Bob", 10, 3, 5), ("Bob", 10, 3, 15), ("Bob", 20, 3, 25),
("Sue", 30, 3, 35),("Sue", 30, 7, 45), ],
["name", "level_A", "level_B", "hours"])
wla = Window.partitionBy("name", "level_A")
wlb = Window.partitionBy("name", "level_B")
result = df.withColumn("hours_A", F.sum("hours").over(wla)) \
.withColumn("hours_B", F.sum("hours").over(wlb)) \
.groupBy("name") \
.agg(
F.map_from_entries(
F.collect_set(F.struct(F.col("level_A"), F.col("hours_A")))
).alias("map_level_A"),
F.map_from_entries(
F.collect_set(F.struct(F.col("level_B"), F.col("hours_B")))
).alias("map_level_B")
)
result.show()
#+----+--------------------+------------------+
#|name| map_level_A| map_level_B|
#+----+--------------------+------------------+
#| Sue| {30 -> 80}|{3 -> 35, 7 -> 45}|
#| Bob|{10 -> 20, 20 -> 25}| {3 -> 45}|
#+----+--------------------+------------------+
Related
I am new to SPARK. I have 2 dataframes events and players
events dataframe consists of columns
event_id| player_id| match_id| impact_score
players dataframe consists of columns
player_id| player_name| nationality
I am merging the two datasets by player_id with this query:
df_final = (events
.orderBy("player_id")
.join(players.orderBy("player_id"))
.withColumn("current_team", when([no idea what goes in here]).otherwise(getCurrentTeam(col("player_id"))))
.write.mode("overwrite")
.partitionBy("current_team")
)
getCurrentTeam function triggers an HTTP call that returns a value (player's current team).
I have data of over 30 million soccer plays and 97 players. I need help creating column current_team. Imagine certain player appearing 130,000 times in events dataframe. I need to lookup values from previous rows. If player appears, I just grab that value (like an in-memory catalog). If it does not appear, then I call the webservice.
Due to it's distributed nature, Spark can't allow for if allow populated in previous call then use it otherwise call created value. There are two possible options.
Since you are applying an inner join and players df has the list of all distinct players, you can add the current_team column to this df before applying a join. If the players df is cached before joining then it's possible that the UDF is invoked only once for each player. See discussion here for why UDF can be called multiple time for each record.
You can memoize getCurrentTeam
Working Example - Prepopulate current_team
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
events_data = [(1, 1, 1, 10), (1, 2, 1, 20, ), (1, 3, 1, 30, ), (2, 3, 1, 30, ), (2, 1, 1, 10), (2, 2, 1, 20, ), ]
players_data = [(1, "Player1", "Nat", ), (2, "Player2", "Nat", ), (3, "Player3", "Nat", ), ]
events = spark.createDataFrame(events_data, ("event_id", "player_id", "match_id", "impact_score", ), ).repartition(3)
players = spark.createDataFrame(players_data, ("player_id", "player_name", "nationality", ), ).repartition(3)
#udf(StringType())
def getCurrentTeam(player_id):
return f"player_{player_id}_team"
players_with_current_team = players.withColumn("current_team", getCurrentTeam(F.col("player_id"))).cache()
events.join(players_with_current_team, ["player_id"]).show()
Output
+---------+--------+--------+------------+-----------+-----------+-------------+
|player_id|event_id|match_id|impact_score|player_name|nationality| current_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
| 2| 2| 1| 20| Player2| Nat|player_2_team|
| 2| 1| 1| 20| Player2| Nat|player_2_team|
| 3| 2| 1| 30| Player3| Nat|player_3_team|
| 3| 1| 1| 30| Player3| Nat|player_3_team|
| 1| 2| 1| 10| Player1| Nat|player_1_team|
| 1| 1| 1| 10| Player1| Nat|player_1_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
Working Example - Memoization
I have used a python dict for mimicing caching and using an accumulator to count number of mimicked network calls made.
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import time
events_data = [(1, 1, 1, 10), (1, 2, 1, 20, ), (1, 3, 1, 30, ), (2, 3, 1, 30, ), (2, 1, 1, 10), (2, 2, 1, 20, ), ]
players_data = [(1, "Player1", "Nat", ), (2, "Player2", "Nat", ), (3, "Player3", "Nat", ), ]
events = spark.createDataFrame(events_data, ("event_id", "player_id", "match_id", "impact_score", ), ).repartition(3)
players = spark.createDataFrame(players_data, ("player_id", "player_name", "nationality", ), ).repartition(3)
players_events_joined = events.join(players, ["player_id"])
memoized_call_counter = spark.sparkContext.accumulator(0)
def memoize_call():
cache = {}
def getCurrentTeam(player_id):
global memoized_call_counter
cached_value = cache.get(player_id, None)
if cached_value is not None:
return cached_value
# sleep to mimic network call
time.sleep(1)
# Increment counter everytime cached value can't be lookedup
memoized_call_counter.add(1)
cache[player_id] = f"player_{player_id}_team"
return cache[player_id]
return getCurrentTeam
getCurrentTeam_udf = udf(memoize_call(), StringType())
players_events_joined.withColumn("current_team", getCurrentTeam_udf(F.col("player_id"))).show()
Output
+---------+--------+--------+------------+-----------+-----------+-------------+
|player_id|event_id|match_id|impact_score|player_name|nationality| current_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
| 2| 2| 1| 20| Player2| Nat|player_2_team|
| 2| 1| 1| 20| Player2| Nat|player_2_team|
| 3| 2| 1| 30| Player3| Nat|player_3_team|
| 3| 1| 1| 30| Player3| Nat|player_3_team|
| 1| 2| 1| 10| Player1| Nat|player_1_team|
| 1| 1| 1| 10| Player1| Nat|player_1_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
>>> memoized_call_counter.value
3
Since there are 3 unique players in total the logic after time.sleep(1) was called only thrice. The number of calls is dependent on the number of workers, since the cache is not shared across workers. As I ran the example in local mode (wuth 1 worker) we see that the number of calls is equal to number of workers.
I am a beginner of PySpark. Suppose I have a Spark dataframe like this:
test_df = spark.createDataFrame(pd.DataFrame({"a":[[1,2,3], [None,2,3], [None, None, None]]}))
Now I hope to filter rows that the array DO NOT contain None value (in my case just keep the first row).
I have tried to use:
test_df.filter(array_contains(test_df.a, None))
But it does not work and throws an error:
AnalysisException: "cannot resolve 'array_contains(a, NULL)' due to
data type mismatch: Null typed values cannot be used as
arguments;;\n'Filter array_contains(a#166, null)\n+- LogicalRDD
[a#166], false\n
How should I filter in the correct way? Many thanks!
You need to use the forall function.
df = test_df.filter(F.expr('forall(a, x -> x is not null)'))
df.show(truncate=False)
You can use aggregate higher order function to count the number of nulls and filter rows with the count = 0. This will enable you to drop all rows with at least 1 None within the array.
data_ls = [
(1, ["A", "B"]),
(2, [None, "D"]),
(3, [None, None])
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['a', 'b'])
data_sdf.show()
+---+------+
| a| b|
+---+------+
| 1|[A, B]|
| 2| [, D]|
| 3| [,]|
+---+------+
# count the number of nulls within an array
data_sdf. \
withColumn('c', func.expr('aggregate(b, 0, (x, y) -> x + int(y is null))')). \
show()
+---+------+---+
| a| b| c|
+---+------+---+
| 1|[A, B]| 0|
| 2| [, D]| 1|
| 3| [,]| 2|
+---+------+---+
Once you have the column created you can apply the filter as filter(func.col('c')==0).
You can use exists function:
test_df.filter("!exists(a, x -> x is null)").show()
#+---------+
#| a|
#+---------+
#|[1, 2, 3]|
#+---------+
Is it possible in pyspark to create dictionary within groupBy.agg()? Here is a toy example:
import pyspark
from pyspark.sql import Row
import pyspark.sql.functions as F
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
toy_data = spark.createDataFrame([
Row(id=1, key='a', value="123"),
Row(id=1, key='b', value="234"),
Row(id=1, key='c', value="345"),
Row(id=2, key='a', value="12"),
Row(id=2, key='x', value="23"),
Row(id=2, key='y', value="123")])
toy_data.show()
+---+---+-----+
| id|key|value|
+---+---+-----+
| 1| a| 123|
| 1| b| 234|
| 1| c| 345|
| 2| a| 12|
| 2| x| 23|
| 2| y| 123|
+---+---+-----+
and this is the expected output:
---+------------------------------------
id | key_value
---+------------------------------------
1 | {"a": "123", "b": "234", "c": "345"}
2 | {"a": "12", "x": "23", "y": "123"}
---+------------------------------------
======================================
I tried this but doesn't work.
toy_data.groupBy("id").agg(
F.create_map(col("key"),col("value")).alias("key_value")
)
This yields the following error:
AnalysisException: u"expression '`key`' is neither present in the group by, nor is it an aggregate function....
The agg component has to contain actual aggregation function. One way to approach this is to combine collect_list
Aggregate function: returns a list of objects with duplicates.
struct:
Creates a new struct column.
and map_from_entries
Collection function: Returns a map created from the given array of entries.
This is how you'd do that:
toy_data.groupBy("id").agg(
F.map_from_entries(
F.collect_list(
F.struct("key", "value"))).alias("key_value")
).show(truncate=False)
+---+------------------------------+
|id |key_value |
+---+------------------------------+
|1 |[a -> 123, b -> 234, c -> 345]|
|2 |[a -> 12, x -> 23, y -> 123] |
+---+------------------------------+
For pyspark < 2.4.0 where pyspark.sql.functions.map_from_entries is not available you can use own created udf function
import pyspark.sql.functions as F
from pyspark.sql.types import MapType, StringType
#F.udf(returnType=MapType(StringType(), StringType()))
def map_array(column):
return dict(column)
(toy_data.groupBy("id")
.agg(F.collect_list(F.struct("key", "value")).alias("key_value"))
.withColumn('key_value', map_array('key_value'))
.show(truncate=False))
+---+------------------------------+
|id |key_value |
+---+------------------------------+
|1 |[a -> 123, b -> 234, c -> 345]|
|2 |[x -> 23, a -> 12, y -> 123] |
+---+------------------------------+
I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.
For example:
Input: PySpark DataFrame containing :
col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]
Ouput :
col_4 = max(col1, col_2, col_3) = [3,2,5]
There is something similar in pandas as explained in this question.
Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?
You can reduce using SQL expressions over a list of columns:
from pyspark.sql.functions import max as max_, col, when
from functools import reduce
def row_max(*cols):
return reduce(
lambda x, y: when(x > y, x).otherwise(y),
[col(c) if isinstance(c, str) else c for c in cols]
)
df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
.toDF(["a", "b", "c"]))
df.select(row_max("a", "b", "c").alias("max")))
Spark 1.5+ also provides least, greatest
from pyspark.sql.functions import greatest
df.select(greatest("a", "b", "c"))
If you want to keep name of the max you can use `structs:
from pyspark.sql.functions import struct, lit
def row_max_with_name(*cols):
cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))
maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))
And finally you can use above to find select "top" column:
from pyspark.sql.functions import max
((_, c), ) = (maxs
.groupBy(col("maxs")["col"].alias("col"))
.count()
.agg(max(struct(col("count"), col("col"))))
.first())
df.select(c)
We can use greatest
Creating DataFrame
df = spark.createDataFrame(
[[1,2,3], [2,1,2], [3,4,5]],
['col_1','col_2','col_3']
)
df.show()
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| 1| 2| 3|
| 2| 1| 2|
| 3| 4| 5|
+-----+-----+-----+
Solution
from pyspark.sql.functions import greatest
df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))
#Only if you need col
#from pyspark.sql.functions import col
#df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
df2.show()
+-----+-----+-----+-----------+
|col_1|col_2|col_3|max_by_rows|
+-----+-----+-----+-----------+
| 1| 2| 3| 3|
| 2| 1| 2| 2|
| 3| 4| 5| 5|
+-----+-----+-----+-----------+
You can also use the pyspark built-in least:
from pyspark.sql.functions import least, col
df = df.withColumn('min', least(col('c1'), col('c2'), col('c3')))
Another simple way of doing it. Let us say that the below df is your dataframe
df = sc.parallelize([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)]).toDF(["c1", "c2", "c3"])
df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10| 1|
|200| 2| 20|
| 3| 30|300|
|400| 40| 4|
+---+---+---+
You can process the above df as below to get the desited results
from pyspark.sql.functions import lit, min
df.select( lit('c1').alias('cn1'), min(df.c1).alias('c1'),
lit('c2').alias('cn2'), min(df.c2).alias('c2'),
lit('c3').alias('cn3'), min(df.c3).alias('c3')
)\
.rdd.flatMap(lambda r: [ (r.cn1, r.c1), (r.cn2, r.c2), (r.cn3, r.c3)])\
.toDF(['Columnn', 'Min']).show()
+-------+---+
|Columnn|Min|
+-------+---+
| c1| 3|
| c2| 2|
| c3| 1|
+-------+---+
Scala solution:
df = sc.parallelize(Seq((10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4))).toDF("c1", "c2", "c3"))
df.rdd.map(row=>List[String](row(0).toString,row(1).toString,row(2).toString)).map(x=>(x(0),x(1),x(2),x.min)).toDF("c1","c2","c3","min").show
+---+---+---+---+
| c1| c2| c3|min|
+---+---+---+---+
| 10| 10| 1| 1|
|200| 2| 20| 2|
| 3| 30|300| 3|
|400| 40| 4| 4|
+---+---+---+---+
I have a data in a file in the following format:
1,32
1,33
1,44
2,21
2,56
1,23
The code I am executing is following:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
import sqlContext.implicits._
case class Person(a: Int, b: Int)
val ppl = sc.textFile("newfile.txt").map(_.split(","))
.map(p=> Person(p(0).trim.toInt, p(1).trim.toInt))
.toDF()
ppl.registerTempTable("people")
val result = ppl.select("a","b").groupBy('a).agg()
result.show
Expected Output is:
a 32, 33, 44, 23
b 21, 56
Instead of aggregation by sum, count, mean etc. I want every element in the row.
Try collect_set function inside agg()
val df = sc.parallelize(Seq(
(1,3), (1,6), (1,5), (2,1),(2,4)
(2,1))).toDF("a","b")
+---+---+
| a| b|
+---+---+
| 1| 3|
| 1| 6|
| 1| 5|
| 2| 1|
| 2| 4|
| 2| 1|
+---+---+
val df2 = df.groupBy("a").agg(collect_set("b")).show()
+---+--------------+
| a|collect_set(b)|
+---+--------------+
| 1| [3, 6, 5]|
| 2| [1, 4]|
+---+--------------+
And if you want duplicate entries , can use collect_list
val df3 = df.groupBy("a").agg(collect_list("b")).show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 1| [3, 6, 5]|
| 2| [1, 4, 1]|
+---+---------------+