How to efficiently mean-normalize columns - apache-spark

I have a large dataset (~4M rows by ~3K columns) and I'm currently mean-normalizing each column using the following code in Python/PySpark:
import pyspark.sql.functions as f
means_pd = df.select(*[f.mean(c).alias(c) for c in df.columns]).toPandas()
diffs = df
for c in df.columns:
mean = means_pd.loc[0,c]
diffs = diffs.withColumn(c, f.col(c) - f.lit(mean))
This is quite slow, particularly the step to loop over the columns. There must be a better way to do this since there are functions like MinMaxScalar that include a step like this but don't take forever. How can I speed this up?

You can calculate the mean over a window and subtract the result.
df.select([(F.col(c) - F.mean(c).over(W.orderBy())).alias(c) for c in df.columns])
This way you avoid the loop (3k withColumn) and do it purely in Spark, without Pandas.
Test:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[(1, 1, 0),
(2, 3, 3),
(3, 5, 6)],
['c1', 'c2', 'c3'])
df_diffs = df.select([(F.col(c) - F.mean(c).over(W.orderBy())).alias(c) for c in df.columns])
df_diffs.show()
# +----+----+----+
# | c1| c2| c3|
# +----+----+----+
# |-1.0|-2.0|-3.0|
# | 0.0| 0.0| 0.0|
# | 1.0| 2.0| 3.0|
# +----+----+----+

Related

Dynamically update a Spark dataframe column when used with lag and window functions

I would like to generate the below dataframe
Here, I am calculating the "adstock" based on the column "col_lag" and an engagement factor 0.9 as below:
# window
windowSpec = Window.partitionBy("id").orderBy("dt")
# create the column if it does not exist
if ('adstock' not in df.columns):
df = df.withColumn("adstock",lit(0))
df = df.withColumn("adstock", (col('col_lag') + (lit(0.9)*(lag("adstock", 1).over(windowSpec)))))
When I run the above, somehow the code does not generate values after two or three rows and gives something like below:
I have around 125000 Ids and weekly data from 2020-01-24 to current week. I tried various methods like rowsBetween(Window.unboundedPreceding, 1) or creation of another column etc., but have not been successful.
I would appreciate any suggestions in this regard.
Spark does not do calculations going from row to row, so it cannot access the result of previous row of the current calculation. To go around this, you may move all the values for the same id to one row and build a calculation logic from there. Higher-order function aggregate allows to do kind-of loops with the ability to access the previous value.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, '2020-10-07', 1),
(1, '2020-10-14', 2),
(1, '2020-10-21', 4),
(1, '2020-10-28', 0),
(2, '2021-09-08', 1),
(2, '2021-09-15', 2),
(2, '2021-09-22', 0),
(2, '2021-09-29', 0)],
['id', 'dt', 'col_lag'])
Script:
df = df.groupBy("id").agg(
F.aggregate(
F.array_sort(F.collect_list(F.struct("dt", "col_lag"))),
F.expr("array(struct(string(null) dt, 0L col_lag, 0D adstock))"),
lambda acc, x: F.array_union(
acc,
F.array(x.withField(
'adstock',
x["col_lag"] + F.lit(0.9) * F.element_at(acc, -1)['adstock']
))
)
).alias("a")
)
df = df.selectExpr("id", "inline(slice(a, 2, size(a)))")
df.show()
# +---+----------+-------+------------------+
# | id| dt|col_lag| adstock|
# +---+----------+-------+------------------+
# | 1|2020-10-07| 1| 1.0|
# | 1|2020-10-14| 2| 2.9|
# | 1|2020-10-21| 4| 6.609999999999999|
# | 1|2020-10-28| 0| 5.949|
# | 2|2021-09-08| 1| 1.0|
# | 2|2021-09-15| 2| 2.9|
# | 2|2021-09-22| 0| 2.61|
# | 2|2021-09-29| 0|2.3489999999999998|
# +---+----------+-------+------------------+
Thorough explanation of the script is provided in this answer.

Pyspark: reshape data without aggregation

I want to reshape my data from 4x3 to 2x2 in pyspark without aggregating. My current output is the following:
columns = ['FAULTY', 'value_HIGH', 'count']
vals = [
(1, 0, 141),
(0, 0, 140),
(1, 1, 21),
(0, 1, 12)
]
What I want is a contingency table with the second column as two new binary columns (value_HIGH_1, value_HIGH_0) and the values from the count column - meaning:
columns = ['FAULTY', 'value_HIGH_1', 'value_HIGH_0']
vals = [
(1, 21, 141),
(0, 12, 140)
]
You can use pivot with a fake maximum aggregation (since you have only one element for each group):
import pyspark.sql.functions as F
df.groupBy('FAULTY').pivot('value_HIGH').agg(F.max('count')).selectExpr(
'FAULTY', '`1` as value_high_1', '`0` as value_high_0'
).show()
+------+------------+------------+
|FAULTY|value_high_1|value_high_0|
+------+------------+------------+
| 0| 12| 140|
| 1| 21| 141|
+------+------------+------------+
Using groupby and pivot is the natural way to do this, but if you want to avoid any aggregation you can achieve this with a filter and join
import pyspark.sql.functions as f
df.where("value_HIGH = 1").select("FAULTY", f.col("count").alias("value_HIGH_1"))\
.join(
df.where("value_HIGH = 0").select("FAULTY", f.col("count").alias("value_HIGH_1")),
on="FAULTY"
)\
.show()
#+------+------------+------------+
#|FAULTY|value_HIGH_1|value_HIGH_1|
#+------+------------+------------+
#| 0| 12| 140|
#| 1| 21| 141|
#+------+------------+------------+

PySpark replace less frequent items with most frequent items

I have a categorical column in a data frame which has some levels, and now I would like to replace those less frequent levels (which have frequencies in terms of percentage of total less than a specified percentage) with the most frequent level. How would I realize that in an elegant and compact way?
Below is an example, if I set the specified frequency as 0.3, then level "c" should be replaced by level "a" since it's frequency is only 1/6 which is below 0.3.
from pyspark.sql import Row
row = Row("foo")
df = sc.parallelize([ row("a"), row("b"), row("c"), row("a"), row("a"), row("b") ]).toDF()
from pyspark.sql import Row
import pyspark.sql.functions as f
#sample data
row = Row("foo")
df = sc.parallelize([ row("a"), row("b"), row("c"), row("a"), row("a"), row("b") ]).toDF()
df_temp = df.groupBy('foo').agg((f.count(f.lit(1))/df.count()).alias("frequency"))
most_frequent_foo = df_temp.sort(f.col('frequency').desc()).select('foo').first()[0]
df_temp = df_temp.withColumn('foo_replaced',
f.when(f.col("frequency") < 0.3, f.lit(most_frequent_foo)).otherwise(f.col('foo')))
df_final = df.join(df_temp, df.foo==df_temp.foo, 'left').drop(df_temp.foo).drop("frequency")
df_final.show()
Output is:
+---+------------+
|foo|foo_replaced|
+---+------------+
| c| a|
| b| b|
| b| b|
| a| a|
| a| a|
| a| a|
+---+------------+

Get IDs for duplicate rows (considering all other columns) in Apache Spark

I have a Spark sql dataframe, consisting of an ID column and n "data" columns, i.e.
id | dat1 | dat2 | ... | datn
The id columnn is uniquely determined, whereas, looking at dat1 ... datn there may be duplicates.
My goal is to find the ids of those duplicates.
My approach so far:
get the duplicate rows using groupBy:
dup_df = df.groupBy(df.columns[1:]).count().filter('count > 1')
join the dup_df with the entire df to get the duplicate rows including id:
df.join(dup_df, df.columns[1:])
I am quite certain that this is basically correct, it fails because the dat1 ... datn columns contain null values.
To do the join on null values, I found .e.g this SO post. But this would require to construct a huge "string join condition".
Thus my questions:
Is there a simple / more generic / more pythonic way to do joins on null values?
Or, even better, is there another (easier, more beautiful, ...) method to get the desired ids?
BTW: I am using Spark 2.1.0 and Python 3.5.3
If number ids per group is relatively small you can groupBy and collect_list. Required imports
from pyspark.sql.functions import collect_list, size
example data:
df = sc.parallelize([
(1, "a", "b", 3),
(2, None, "f", None),
(3, "g", "h", 4),
(4, None, "f", None),
(5, "a", "b", 3)
]).toDF(["id"])
query:
(df
.groupBy(df.columns[1:])
.agg(collect_list("id").alias("ids"))
.where(size("ids") > 1))
and the result:
+----+---+----+------+
| _2| _3| _4| ids|
+----+---+----+------+
|null| f|null|[2, 4]|
| a| b| 3|[1, 5]|
+----+---+----+------+
You can apply explode twice (or use an udf) to an output equivalent to the one returned from join.
You can also identify groups using minimal id per group. A few additional imports:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, count, min
window definition:
w = Window.partitionBy(df.columns[1:])
query:
(df
.select(
"*",
count("*").over(w).alias("_cnt"),
min("id").over(w).alias("group"))
.where(col("_cnt") > 1))
and the result:
+---+----+---+----+----+-----+
| id| _2| _3| _4|_cnt|group|
+---+----+---+----+----+-----+
| 2|null| f|null| 2| 2|
| 4|null| f|null| 2| 2|
| 1| a| b| 3| 2| 1|
| 5| a| b| 3| 2| 1|
+---+----+---+----+----+-----+
You can further use group column for self join.

Creating a column based upon a list and column in Pyspark

I have a pyspark DataFrame, say df1, with multiple columns.
I also have a list, say, l = ['a','b','c','d'] and these values are the subset of the values present in one of the columns in the DataFrame.
Now, I would like to do something like this:
df2 = df1.withColumn('new_column', expr("case when col_1 in l then 'yes' else 'no' end"))
But this is throwing the following error:
failure: "(" expected but identifier l found.
Any idea how to resolve this error or any better way of doing it?
You can achieve that with the isin function of the Column object:
df1 = sqlContext.createDataFrame([('a', 1), ('b', 2), ('c', 3)], ('col1', 'col2'))
l = ['a', 'b']
from pyspark.sql.functions import *
df2 = df1.withColumn('new_column', when(col('col1').isin(l), 'yes').otherwise('no'))
df2.show()
+----+----+----------+
|col1|col2|new_column|
+----+----+----------+
| a| 1| yes|
| b| 2| yes|
| c| 3| no|
+----+----+----------+
Note: For Spark < 1.5, use inSet instead of isin.
Reference: pyspark.sql.Column documentation

Resources