Pyspark: reshape data without aggregation - apache-spark

I want to reshape my data from 4x3 to 2x2 in pyspark without aggregating. My current output is the following:
columns = ['FAULTY', 'value_HIGH', 'count']
vals = [
(1, 0, 141),
(0, 0, 140),
(1, 1, 21),
(0, 1, 12)
]
What I want is a contingency table with the second column as two new binary columns (value_HIGH_1, value_HIGH_0) and the values from the count column - meaning:
columns = ['FAULTY', 'value_HIGH_1', 'value_HIGH_0']
vals = [
(1, 21, 141),
(0, 12, 140)
]

You can use pivot with a fake maximum aggregation (since you have only one element for each group):
import pyspark.sql.functions as F
df.groupBy('FAULTY').pivot('value_HIGH').agg(F.max('count')).selectExpr(
'FAULTY', '`1` as value_high_1', '`0` as value_high_0'
).show()
+------+------------+------------+
|FAULTY|value_high_1|value_high_0|
+------+------------+------------+
| 0| 12| 140|
| 1| 21| 141|
+------+------------+------------+

Using groupby and pivot is the natural way to do this, but if you want to avoid any aggregation you can achieve this with a filter and join
import pyspark.sql.functions as f
df.where("value_HIGH = 1").select("FAULTY", f.col("count").alias("value_HIGH_1"))\
.join(
df.where("value_HIGH = 0").select("FAULTY", f.col("count").alias("value_HIGH_1")),
on="FAULTY"
)\
.show()
#+------+------------+------------+
#|FAULTY|value_HIGH_1|value_HIGH_1|
#+------+------------+------------+
#| 0| 12| 140|
#| 1| 21| 141|
#+------+------------+------------+

Related

How to randomize different numbers for subgroup of rows pyspark

I have a pyspark dataframe. I need to randomize values taken from list for all rows within given condition. I did:
df = df.withColumn('rand_col', f.when(f.col('condition_col') == condition, random.choice(my_list)))
but the effect is, that it randomizes only one value and assigns it to all rows:
How can I randomize separately for each row?
You can:
use rand and floor from pyspark.sql.functions to create a random indexing column to index into your my_list
create a column in which the my_list value is repeated
index into that column using f.col
It would look something like this:
import pyspark.sql.functions as f
my_list = [1, 2, 30]
df = spark.createDataFrame(
[
(1, 0),
(2, 1),
(3, 1),
(4, 0),
(5, 1),
(6, 1),
(7, 0),
],
["id", "condition"]
)
df = df.withColumn('rand_index', f.when(f.col('condition') == 1, f.floor(f.rand() * len(my_list))))\
.withColumn('my_list', f.array([f.lit(x) for x in my_list]))\
.withColumn('rand_value', f.when(f.col('condition') == 1, f.col("my_list")[f.col("rand_index")]))
df.show()
+---+---------+----------+----------+----------+
| id|condition|rand_index| my_list|rand_value|
+---+---------+----------+----------+----------+
| 1| 0| null|[1, 2, 30]| null|
| 2| 1| 0|[1, 2, 30]| 1|
| 3| 1| 2|[1, 2, 30]| 30|
| 4| 0| null|[1, 2, 30]| null|
| 5| 1| 1|[1, 2, 30]| 2|
| 6| 1| 2|[1, 2, 30]| 30|
| 7| 0| null|[1, 2, 30]| null|
+---+---------+----------+----------+----------+

SPARK 3 - Populate value with value from previous rows (lookup)

I am new to SPARK. I have 2 dataframes events and players
events dataframe consists of columns
event_id| player_id| match_id| impact_score
players dataframe consists of columns
player_id| player_name| nationality
I am merging the two datasets by player_id with this query:
df_final = (events
.orderBy("player_id")
.join(players.orderBy("player_id"))
.withColumn("current_team", when([no idea what goes in here]).otherwise(getCurrentTeam(col("player_id"))))
.write.mode("overwrite")
.partitionBy("current_team")
)
getCurrentTeam function triggers an HTTP call that returns a value (player's current team).
I have data of over 30 million soccer plays and 97 players. I need help creating column current_team. Imagine certain player appearing 130,000 times in events dataframe. I need to lookup values from previous rows. If player appears, I just grab that value (like an in-memory catalog). If it does not appear, then I call the webservice.
Due to it's distributed nature, Spark can't allow for if allow populated in previous call then use it otherwise call created value. There are two possible options.
Since you are applying an inner join and players df has the list of all distinct players, you can add the current_team column to this df before applying a join. If the players df is cached before joining then it's possible that the UDF is invoked only once for each player. See discussion here for why UDF can be called multiple time for each record.
You can memoize getCurrentTeam
Working Example - Prepopulate current_team
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
events_data = [(1, 1, 1, 10), (1, 2, 1, 20, ), (1, 3, 1, 30, ), (2, 3, 1, 30, ), (2, 1, 1, 10), (2, 2, 1, 20, ), ]
players_data = [(1, "Player1", "Nat", ), (2, "Player2", "Nat", ), (3, "Player3", "Nat", ), ]
events = spark.createDataFrame(events_data, ("event_id", "player_id", "match_id", "impact_score", ), ).repartition(3)
players = spark.createDataFrame(players_data, ("player_id", "player_name", "nationality", ), ).repartition(3)
#udf(StringType())
def getCurrentTeam(player_id):
return f"player_{player_id}_team"
players_with_current_team = players.withColumn("current_team", getCurrentTeam(F.col("player_id"))).cache()
events.join(players_with_current_team, ["player_id"]).show()
Output
+---------+--------+--------+------------+-----------+-----------+-------------+
|player_id|event_id|match_id|impact_score|player_name|nationality| current_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
| 2| 2| 1| 20| Player2| Nat|player_2_team|
| 2| 1| 1| 20| Player2| Nat|player_2_team|
| 3| 2| 1| 30| Player3| Nat|player_3_team|
| 3| 1| 1| 30| Player3| Nat|player_3_team|
| 1| 2| 1| 10| Player1| Nat|player_1_team|
| 1| 1| 1| 10| Player1| Nat|player_1_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
Working Example - Memoization
I have used a python dict for mimicing caching and using an accumulator to count number of mimicked network calls made.
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import time
events_data = [(1, 1, 1, 10), (1, 2, 1, 20, ), (1, 3, 1, 30, ), (2, 3, 1, 30, ), (2, 1, 1, 10), (2, 2, 1, 20, ), ]
players_data = [(1, "Player1", "Nat", ), (2, "Player2", "Nat", ), (3, "Player3", "Nat", ), ]
events = spark.createDataFrame(events_data, ("event_id", "player_id", "match_id", "impact_score", ), ).repartition(3)
players = spark.createDataFrame(players_data, ("player_id", "player_name", "nationality", ), ).repartition(3)
players_events_joined = events.join(players, ["player_id"])
memoized_call_counter = spark.sparkContext.accumulator(0)
def memoize_call():
cache = {}
def getCurrentTeam(player_id):
global memoized_call_counter
cached_value = cache.get(player_id, None)
if cached_value is not None:
return cached_value
# sleep to mimic network call
time.sleep(1)
# Increment counter everytime cached value can't be lookedup
memoized_call_counter.add(1)
cache[player_id] = f"player_{player_id}_team"
return cache[player_id]
return getCurrentTeam
getCurrentTeam_udf = udf(memoize_call(), StringType())
players_events_joined.withColumn("current_team", getCurrentTeam_udf(F.col("player_id"))).show()
Output
+---------+--------+--------+------------+-----------+-----------+-------------+
|player_id|event_id|match_id|impact_score|player_name|nationality| current_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
| 2| 2| 1| 20| Player2| Nat|player_2_team|
| 2| 1| 1| 20| Player2| Nat|player_2_team|
| 3| 2| 1| 30| Player3| Nat|player_3_team|
| 3| 1| 1| 30| Player3| Nat|player_3_team|
| 1| 2| 1| 10| Player1| Nat|player_1_team|
| 1| 1| 1| 10| Player1| Nat|player_1_team|
+---------+--------+--------+------------+-----------+-----------+-------------+
>>> memoized_call_counter.value
3
Since there are 3 unique players in total the logic after time.sleep(1) was called only thrice. The number of calls is dependent on the number of workers, since the cache is not shared across workers. As I ran the example in local mode (wuth 1 worker) we see that the number of calls is equal to number of workers.

Partitioning by multiple columns in PySpark with columns in a list

My question is similar to this thread:
Partitioning by multiple columns in Spark SQL
but I'm working in Pyspark rather than Scala and I want to pass in my list of columns as a list. I want to do something like this:
column_list = ["col1","col2"]
win_spec = Window.partitionBy(column_list)
I can get the following to work:
win_spec = Window.partitionBy(col("col1"))
This also works:
col_name = "col1"
win_spec = Window.partitionBy(col(col_name))
And this also works:
win_spec = Window.partitionBy([col("col1"), col("col2")])
Convert column names to column expressions with a list comprehension [col(x) for x in column_list]:
from pyspark.sql.functions import col
from pyspark.sql import Window
column_list = ["col1","col2"]
win_spec = Window.partitionBy([col(x) for x in column_list])
PySpark >= 2.4, this works too =>
column_list = ["col1","col2"]
win_spec = Window.partitionBy(*column_list)
Your first attempt should work.
Consider the following example:
import pyspark.sql.functions as f
from pyspark.sql import Window
df = sqlCtx.createDataFrame(
[
("a", "apple", 1),
("a", "orange", 2),
("a", "orange", 3),
("b", "orange", 3),
("b", "orange", 5)
],
["name", "fruit","value"]
)
df.show()
#+----+------+-----+
#|name| fruit|value|
#+----+------+-----+
#| a| apple| 1|
#| a|orange| 2|
#| a|orange| 3|
#| b|orange| 3|
#| b|orange| 5|
#+----+------+-----+
Suppose you wanted to calculate a fraction of the sum for each row, grouping by the first two columns:
cols = ["name", "fruit"]
w = Window.partitionBy(cols)
df.select(cols + [(f.col('value') / f.sum('value').over(w)).alias('fraction')]).show()
#+----+------+--------+
#|name| fruit|fraction|
#+----+------+--------+
#| a| apple| 1.0|
#| b|orange| 0.375|
#| b|orange| 0.625|
#| a|orange| 0.6|
#| a|orange| 0.4|
#+----+------+--------+

Why does createDataFrame reorder the columns?

Suppose I am creating a data frame from a list without a schema:
data = [Row(c=0, b=1, a=2), Row(c=10, b=11, a=12)]
df = spark.createDataFrame(data)
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 2| 1| 0|
| 12| 11| 10|
+---+---+---+
Why are the columns reordered in alphabet order ?
Can I preserve the original order of columns without adding a schema ?
Why are the columns reordered in alphabet order ?
Because Row created with **kwargs sorts the arguments by name.
This design choice is required to address the issues described in PEP 468. Please check SPARK-12467 for a discussion.
Can I preserve the original order of columns without adding a schema ?
Not with **kwargs. You can use plain tuples:
df = spark.createDataFrame([(0, 1, 2), (10, 11, 12)], ["c", "b", "a"])
or namedtuple:
from collections import namedtuple
CBA = namedtuple("CBA", ["c", "b", "a"])
spark.createDataFrame([CBA(0, 1, 2), CBA(10, 11, 12)])

Get IDs for duplicate rows (considering all other columns) in Apache Spark

I have a Spark sql dataframe, consisting of an ID column and n "data" columns, i.e.
id | dat1 | dat2 | ... | datn
The id columnn is uniquely determined, whereas, looking at dat1 ... datn there may be duplicates.
My goal is to find the ids of those duplicates.
My approach so far:
get the duplicate rows using groupBy:
dup_df = df.groupBy(df.columns[1:]).count().filter('count > 1')
join the dup_df with the entire df to get the duplicate rows including id:
df.join(dup_df, df.columns[1:])
I am quite certain that this is basically correct, it fails because the dat1 ... datn columns contain null values.
To do the join on null values, I found .e.g this SO post. But this would require to construct a huge "string join condition".
Thus my questions:
Is there a simple / more generic / more pythonic way to do joins on null values?
Or, even better, is there another (easier, more beautiful, ...) method to get the desired ids?
BTW: I am using Spark 2.1.0 and Python 3.5.3
If number ids per group is relatively small you can groupBy and collect_list. Required imports
from pyspark.sql.functions import collect_list, size
example data:
df = sc.parallelize([
(1, "a", "b", 3),
(2, None, "f", None),
(3, "g", "h", 4),
(4, None, "f", None),
(5, "a", "b", 3)
]).toDF(["id"])
query:
(df
.groupBy(df.columns[1:])
.agg(collect_list("id").alias("ids"))
.where(size("ids") > 1))
and the result:
+----+---+----+------+
| _2| _3| _4| ids|
+----+---+----+------+
|null| f|null|[2, 4]|
| a| b| 3|[1, 5]|
+----+---+----+------+
You can apply explode twice (or use an udf) to an output equivalent to the one returned from join.
You can also identify groups using minimal id per group. A few additional imports:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, count, min
window definition:
w = Window.partitionBy(df.columns[1:])
query:
(df
.select(
"*",
count("*").over(w).alias("_cnt"),
min("id").over(w).alias("group"))
.where(col("_cnt") > 1))
and the result:
+---+----+---+----+----+-----+
| id| _2| _3| _4|_cnt|group|
+---+----+---+----+----+-----+
| 2|null| f|null| 2| 2|
| 4|null| f|null| 2| 2|
| 1| a| b| 3| 2| 1|
| 5| a| b| 3| 2| 1|
+---+----+---+----+----+-----+
You can further use group column for self join.

Resources