Pyspark StandardScaler over a Window - python-3.x

I want to use the standardscaler pyspark.ml.feature.StandardScaler over window of my data.
df4=spark.createDataFrame(
[
(1,1, 'X', 'a'),
(2,1, 'X', 'a'),
(3,9, 'X', 'b'),
(5,1, 'X', 'b'),
(6,2, 'X', 'c'),
(7,2, 'X', 'c'),
(8,10, 'Y', 'a'),
(9,45, 'Y', 'a'),
(10,3, 'Y', 'a'),
(11,3, 'Y', 'b'),
(12,6, 'Y', 'b'),
(13,19,'Y', 'b')
],
['id','feature', 'txt', 'cat']
)
w = Window().partitionBy(..)
I can do this over the whole dataframe by calling the .fit& .transform methods. But not on the w variable which we use generally like F.col('feature') - F.mean('feature').over(w).
I can transform all my windowed/grouped data into separate columns, put it into a dataframe and then apply StandardScaler over it and transform back to 1D. Is there any other method ? The ultimate goal is to try different scalers including pyspark.ml.feature.RobustScaler.

I eventually had to write my own scaler class. Using the pyspark StandardScaler in the above problem is not suitable as we all know it is more efficient for end to end series transformations. Nonetheless I came up with my own scaler. It does not really use Window from pyspark but i achieve the functionality using groupby.
class StandardScaler:
tol = 0.000001
def __init__(self, colsTotransform, groupbyCol='txt', orderBycol='id'):
self.colsTotransform = colsTotransform
self.groupbyCol=groupbyCol
self.orderBycol=orderBycol
def __tempNames__(self):
return [(f"{colname}_transformed",colname) for colname in self.colsTotransform]
def fit(self, df):
funcs = [(F.mean(name), F.stddev(name)) for name in self.colsTotransform]
exprs = [ff for tup in funcs for ff in tup]
self.stats = df.groupBy([self.groupbyCol]).agg(*exprs)
def __transformOne__(self, df_with_stats, newName, colName):
return df_with_stats\
.withColumn(newName,
(F.col(colName)-F.col(f'avg({colName})'))/(F.col(f'stddev_samp({colName})')+self.tol))\
.drop(colName)\
.withColumnRenamed(newName, colName)
def transform(self, df):
df_with_stats = df.join(self.stats, on=self.groupbyCol, how='inner').orderBy(self.orderBycol)
return reduce(lambda df_with_stats, kv: self.__transformOne__(df_with_stats, *kv),
self.__tempNames__(), df_with_stats)[df.columns]
Usage :
ss = StandardScaler(colsTotransform=['feature'],groupbyCol='txt',orderbyCol='id')
ss.fit(df4)
ss.stats.show()
+---+------------------+--------------------+
|txt| avg(feature)|stddev_samp(feature)|
+---+------------------+--------------------+
| Y|14.333333333333334| 16.169930941926335|
| X|2.6666666666666665| 3.1411250638372654|
+---+------------------+--------------------+
df4.show()
+---+-------+---+---+
| id|feature|txt|cat|
+---+-------+---+---+
| 1| 1| X| a|
| 2| 1| X| a|
| 3| 9| X| b|
| 5| 1| X| b|
| 6| 2| X| c|
| 7| 2| X| c|
| 8| 10| Y| a|
| 9| 45| Y| a|
| 10| 3| Y| a|
| 11| 3| Y| b|
| 12| 6| Y| b|
| 13| 19| Y| b|
+---+-------+---+---+
ss.transform(df4).show()
+---+--------------------+---+---+
| id| feature|txt|cat|
+---+--------------------+---+---+
| 1| -0.530595281053646| X| a|
| 2| -0.530595281053646| X| a|
| 3| 2.0162620680038548| X| b|
| 5| -0.530595281053646| X| b|
| 6|-0.21223811242145835| X| c|
| 7|-0.21223811242145835| X| c|
| 8| -0.2679871102053074| Y| a|
| 9| 1.8965241645298676| Y| a|
| 10| -0.7008893651523425| Y| a|
| 11| -0.7008893651523425| Y| b|
| 12| -0.5153598273178989| Y| b|
| 13| 0.2886015032980233| Y| b|
+---+--------------------+---+---+

Related

Conditions in Spark window function

I have a dataframe like
+---+---+---+---+
| q| w| e| r|
+---+---+---+---+
| a| 1| 20| y|
| a| 2| 22| z|
| b| 3| 10| y|
| b| 4| 12| y|
+---+---+---+---+
I want to mark the rows with the minimum e and r = z . If there are no rows which have r = z, I want the row with the minimum e, even if r = y.
Essentially, something like
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 1| 20| y| 0|
| a| 2| 22| z| 1|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
I can do it using a number of joins, but that would be too expensive.
So I was looking for a window-based solution.
You can calculate the minimum per group once for rows with r = z and then for all rows within a group. The first non-null value can then be compared to e:
from pyspark.sql import functions as F
from pyspark.sql import Window
df = ...
w = Window.partitionBy("q")
#When ordering is not defined, an unbounded window frame is used by default.
df.withColumn("min_e_with_r_eq_z", F.expr("min(case when r='z' then e else null end)").over(w)) \
.withColumn("min_e_overall", F.min("e").over(w)) \
.withColumn("t", F.coalesce("min_e_with_r_eq_z","min_e_overall") == F.col("e")) \
.orderBy("w") \
.show()
Output:
+---+---+---+---+-----------------+-------------+-----+
| q| w| e| r|min_e_with_r_eq_z|min_e_overall| t|
+---+---+---+---+-----------------+-------------+-----+
| a| 1| 20| y| 22| 20|false|
| a| 2| 22| z| 22| 20| true|
| b| 3| 10| y| null| 10| true|
| b| 4| 12| y| null| 10|false|
+---+---+---+---+-----------------+-------------+-----+
Note: I assume that q is the grouping column for the window.
You can assign row numbers based on whether r = z and the value of column e:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
't',
F.when(
F.row_number().over(
Window.partitionBy('q')
.orderBy((F.col('r') == 'z').desc(), 'e')
) == 1,
1
).otherwise(0)
)
df2.show()
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 2| 22| z| 1|
| a| 1| 20| y| 0|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
Adding the spark-scala version of #werner 's accepted answer
val w = Window.partitionBy("q")
df.withColumn("min_e_with_r_eq_z", min(when($"r" === "z", $"e").otherwise(null)).over(w))
.withColumn("min_e_overall", min("e").over(w))
.withColumn("t", coalesce($"min_e_with_r_eq_z", $"min_e_overall") === $"e")
.orderBy("w")
.show()

Why sum is not displaying after aggregation & pivot?

Here I have student marks like below and I want to transpose subject name column and want to get the total marks also after the pivot.
Source table like:
+---------+-----------+-----+
|StudentId|SubjectName|Marks|
+---------+-----------+-----+
| 1| A| 10|
| 1| B| 20|
| 1| C| 30|
| 2| A| 20|
| 2| B| 25|
| 2| C| 30|
| 3| A| 10|
| 3| B| 20|
| 3| C| 20|
+---------+-----------+-----+
Destination:
+---------+---+---+---+-----+
|StudentId| A| B| C|Total|
+---------+---+---+---+-----+
| 1| 10| 20| 30| 60|
| 3| 10| 20| 20| 50|
| 2| 20| 25| 30| 75|
+---------+---+---+---+-----+
Please find the below source code:
val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val list = List((1, "A", 10), (1, "B", 20), (1, "C", 30), (2, "A", 20), (2, "B", 25), (2, "C", 30), (3, "A", 10),
(3, "B", 20), (3, "C", 20))
val df = list.toDF("StudentId", "SubjectName", "Marks")
df.show() // source table as per above
val df1 = df.groupBy("StudentId").pivot("SubjectName", Seq("A", "B", "C")).agg(sum("Marks"))
df1.show()
val df2 = df1.withColumn("Total", col("A") + col("B") + col("C"))
df2.show // required destitnation
val df3 = df.groupBy("StudentId").agg(sum("Marks").as("Total"))
df3.show()
df1 is not displaying the sum/total column. it's displaying like below.
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 10| 20| 30|
| 3| 10| 20| 20|
| 2| 20| 25| 30|
+---------+---+---+---+
df3 is able to create new Total column but why in df1 it not able to create a new column?
Please, can anybody help me what I missing or anything wrong with my understanding of pivot concept?
This is an expected behaviour from spark pivot function as .agg function is applied on the pivoted columns that's the reason why you are not able to see sum of marks as new column.
Refer to this link for official documentation about pivot.
Example:
scala> df.groupBy("StudentId").pivot("SubjectName").agg(sum("Marks") + 2).show()
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 12| 22| 32|
| 3| 12| 22| 22|
| 2| 22| 27| 32|
+---------+---+---+---+
In the above example we have added 2 to all the pivoted columns.
Example2:
To get count using pivot and agg
scala> df.groupBy("StudentId").pivot("SubjectName").agg(count("*")).show()
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 1| 1| 1|
| 3| 1| 1| 1|
| 2| 1| 1| 1|
+---------+---+---+---+
The .agg followed by pivot is applicable only for the pivoted data. To find the sum you should you should add new column and sum it as below.
val cols = Seq("A", "B", "C")
val result = df.groupBy("StudentId")
.pivot("SubjectName")
.agg(sum("Marks"))
.withColumn("Total", cols.map(col _).reduce(_ + _))
result.show(false)
Output:
+---------+---+---+---+-----+
|StudentId|A |B |C |Total|
+---------+---+---+---+-----+
|1 |10 |20 |30 |60 |
|3 |10 |20 |20 |50 |
|2 |20 |25 |30 |75 |
+---------+---+---+---+-----+

How the check the Boolean condition between the column in Dataframe

I have the Dataframe I want to check the condition in between the column:
+---+----+------+---------+------+
| ID|Name|Salary|Operation|Points|
+---+----+------+---------+------+
| 1| A| 10000| a AND b| 100|
| 1| A| 10000| a OR b| 200|
| 1| A| 10000|otherwise| 0|
| 2| B| 200| a AND b| 100|
| 2| B| 200| a OR b| 200|
| 2| B| 200|otherwise| 0|
| 3| C| 700| a AND b| 100|
| 3| C| 700| a OR b| 200|
| 3| C| 700|otherwise| 0|
| 4| D| 1000| a AND b| 100|
| 4| D| 1000| a OR b| 200|
| 4| D| 1000|otherwise| 0|
| 5| E| 650| a AND b| 100|
| 5| E| 650| a OR b| 200|
| 5| E| 650|otherwise| 0|
+---+----+------+---------+------+
Where:
a='salary==1000'
b='salary>500'
If the operation will true so points will assign and new column will add in the dataframe by name reward
for eg
if first entry occur A having salary is 10000 check the condition a if salary is equal to 1000 and salary is greater then 500 so a AND b false so 0 point will asign
result:
+---+----+------+------+
| ID|Name|Salary|Reward|
+---+----+------+------+
| 1| A| 10000| 200|
| 2| B| 200| 0|
| 3| C| 700| 200|
| 4| D| 1000| 200|
| 5| E| 650| 200|
+---+----+------+------+
You can piece something together with a filter expression and a groupby:
import pyspark.sql.functions as F
l = [
( 1, 'A', 10000, 'a AND b', 100),
( 1, 'A', 10000, 'a OR b', 200),
( 1, 'A', 10000,'otherwise', 0),
( 2, 'B', 200, 'a AND b', 100),
( 2, 'B', 200, 'a OR b', 200),
( 2, 'B', 200,'otherwise', 0),
( 3, 'C', 700, 'a AND b', 100),
( 3, 'C', 700, 'a OR b', 200),
( 3, 'C', 700,'otherwise', 0),
( 4, 'D', 1000, 'a AND b', 100),
( 4, 'D', 1000, 'a OR b', 200),
( 4, 'D', 1000,'otherwise', 0),
( 5, 'E', 650, 'a AND b', 100),
( 5, 'E', 650, 'a OR b', 200),
( 5, 'E', 650,'otherwise', 0)]
columns = ['ID','Name','Salary','Operation','Points']
df=spark.createDataFrame(l, columns)
df.filter(
(df.Operation.contains('AND') & (df.Salary == 1000) & (df.Salary > 500)) |
(df.Operation.contains('OR') & ((df.Salary == 1000) | (df.Salary > 500))) |
df.Operation.contains('otherwise')
).groupBy('ID', 'Name', 'Salary').agg(F.max('Points').alias('Rewards')).show()
Output:
+---+----+------+-------+
| ID|Name|Salary|Rewards|
+---+----+------+-------+
| 1| A| 10000| 200|
| 2| B| 200| 0|
| 3| C| 700| 200|
| 5| E| 650| 200|
| 4| D| 1000| 200|
+---+----+------+-------+
Please also have a look at a similar question and the answer of Shan.

Difference in dense rank and row number in spark

I tried to understand the difference between dense rank and row number.Each new window partition both is starting from 1. Does rank of a row is not always start from 1 ? Any help would be appreciated
The difference is when there are "ties" in the ordering column. Check the example below:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df = Seq(("a", 10), ("a", 10), ("a", 20)).toDF("col1", "col2")
val windowSpec = Window.partitionBy("col1").orderBy("col2")
df
.withColumn("rank", rank().over(windowSpec))
.withColumn("dense_rank", dense_rank().over(windowSpec))
.withColumn("row_number", row_number().over(windowSpec)).show
+----+----+----+----------+----------+
|col1|col2|rank|dense_rank|row_number|
+----+----+----+----------+----------+
| a| 10| 1| 1| 1|
| a| 10| 1| 1| 2|
| a| 20| 3| 2| 3|
+----+----+----+----------+----------+
Note that the value "10" exists twice in col2 within the same window (col1 = "a"). That's when you see a difference between the three functions.
I'm showing #Daniel's answer in Python and I'm adding a comparison with count('*') that can be used if you want to get top-n at most rows per group.
from pyspark.sql.session import SparkSession
from pyspark.sql import Window
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
['a', 10], ['a', 20], ['a', 30],
['a', 40], ['a', 40], ['a', 40], ['a', 40],
['a', 50], ['a', 50], ['a', 60]], ['part_col', 'order_col'])
window = Window.partitionBy("part_col").orderBy("order_col")
df = (df
.withColumn("rank", F.rank().over(window))
.withColumn("dense_rank", F.dense_rank().over(window))
.withColumn("row_number", F.row_number().over(window))
.withColumn("count", F.count('*').over(window))
)
df.show()
+--------+---------+----+----------+----------+-----+
|part_col|order_col|rank|dense_rank|row_number|count|
+--------+---------+----+----------+----------+-----+
| a| 10| 1| 1| 1| 1|
| a| 20| 2| 2| 2| 2|
| a| 30| 3| 3| 3| 3|
| a| 40| 4| 4| 4| 7|
| a| 40| 4| 4| 5| 7|
| a| 40| 4| 4| 6| 7|
| a| 40| 4| 4| 7| 7|
| a| 50| 8| 5| 8| 9|
| a| 50| 8| 5| 9| 9|
| a| 60| 10| 6| 10| 10|
+--------+---------+----+----------+----------+-----+
For example if you want to take at most 4 without randomly picking one of the 4 "40" of the sorting column:
df.where("count <= 4").show()
+--------+---------+----+----------+----------+-----+
|part_col|order_col|rank|dense_rank|row_number|count|
+--------+---------+----+----------+----------+-----+
| a| 10| 1| 1| 1| 1|
| a| 20| 2| 2| 2| 2|
| a| 30| 3| 3| 3| 3|
+--------+---------+----+----------+----------+-----+
In summary, if you filter <= n those columns you will get:
rank at least n rows
dense_rank at least n different order_col values
row_number exactly n rows
count at most n rows

Unpivot in Spark SQL / PySpark

I have a problem statement at hand wherein I want to unpivot table in Spark SQL / PySpark. I have gone through the documentation and I could see there is support only for pivot, but no support for un-pivot so far.
Is there a way I can achieve this?
Let my initial table look like this:
When I pivot this in PySpark:
df.groupBy("A").pivot("B").sum("C")
I get this as the output:
Now I want to unpivot the pivoted table. In general, this operation may/may not yield the original table based on how I've pivoted the original table.
Spark SQL as of now doesn't provide out of the box support for unpivot. Is there a way I can achieve this?
You can use the built in stack function, for example in Scala:
scala> val df = Seq(("G",Some(4),2,None),("H",None,4,Some(5))).toDF("A","X","Y", "Z")
df: org.apache.spark.sql.DataFrame = [A: string, X: int ... 2 more fields]
scala> df.show
+---+----+---+----+
| A| X| Y| Z|
+---+----+---+----+
| G| 4| 2|null|
| H|null| 4| 5|
+---+----+---+----+
scala> df.select($"A", expr("stack(3, 'X', X, 'Y', Y, 'Z', Z) as (B, C)")).where("C is not null").show
+---+---+---+
| A| B| C|
+---+---+---+
| G| X| 4|
| G| Y| 2|
| H| Y| 4|
| H| Z| 5|
+---+---+---+
Or in pyspark:
In [1]: df = spark.createDataFrame([("G",4,2,None),("H",None,4,5)],list("AXYZ"))
In [2]: df.show()
+---+----+---+----+
| A| X| Y| Z|
+---+----+---+----+
| G| 4| 2|null|
| H|null| 4| 5|
+---+----+---+----+
In [3]: df.selectExpr("A", "stack(3, 'X', X, 'Y', Y, 'Z', Z) as (B, C)").where("C is not null").show()
+---+---+---+
| A| B| C|
+---+---+---+
| G| X| 4|
| G| Y| 2|
| H| Y| 4|
| H| Z| 5|
+---+---+---+
Spark 3.4+
df = df.melt(['A'], ['X', 'Y', 'Z'], 'B', 'C')
# OR
df = df.unpivot(['A'], ['X', 'Y', 'Z'], 'B', 'C')
+---+---+----+
| A| B| C|
+---+---+----+
| G| Y| 2|
| G| Z|null|
| G| X| 4|
| H| Y| 4|
| H| Z| 5|
| H| X|null|
+---+---+----+
To filter out nulls: df = df.filter("C is not null")
Spark 3.3 and below
to_melt = {'X', 'Y', 'Z'}
new_names = ['B', 'C']
melt_str = ','.join([f"'{c}', `{c}`" for c in to_melt])
df = df.select(
*(set(df.columns) - to_melt),
F.expr(f"stack({len(to_melt)}, {melt_str}) ({','.join(new_names)})")
).filter(f"!{new_names[1]} is null")
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame([("G", 4, 2, None), ("H", None, 4, 5)], list("AXYZ"))
to_melt = {'X', 'Y', 'Z'}
new_names = ['B', 'C']
melt_str = ','.join([f"'{c}', `{c}`" for c in to_melt])
df = df.select(
*(set(df.columns) - to_melt),
F.expr(f"stack({len(to_melt)}, {melt_str}) ({','.join(new_names)})")
).filter(f"!{new_names[1]} is null")
df.show()
# +---+---+---+
# | A| B| C|
# +---+---+---+
# | G| Y| 2|
# | G| X| 4|
# | H| Y| 4|
# | H| Z| 5|
# +---+---+---+

Resources