How the check the Boolean condition between the column in Dataframe - apache-spark

I have the Dataframe I want to check the condition in between the column:
+---+----+------+---------+------+
| ID|Name|Salary|Operation|Points|
+---+----+------+---------+------+
| 1| A| 10000| a AND b| 100|
| 1| A| 10000| a OR b| 200|
| 1| A| 10000|otherwise| 0|
| 2| B| 200| a AND b| 100|
| 2| B| 200| a OR b| 200|
| 2| B| 200|otherwise| 0|
| 3| C| 700| a AND b| 100|
| 3| C| 700| a OR b| 200|
| 3| C| 700|otherwise| 0|
| 4| D| 1000| a AND b| 100|
| 4| D| 1000| a OR b| 200|
| 4| D| 1000|otherwise| 0|
| 5| E| 650| a AND b| 100|
| 5| E| 650| a OR b| 200|
| 5| E| 650|otherwise| 0|
+---+----+------+---------+------+
Where:
a='salary==1000'
b='salary>500'
If the operation will true so points will assign and new column will add in the dataframe by name reward
for eg
if first entry occur A having salary is 10000 check the condition a if salary is equal to 1000 and salary is greater then 500 so a AND b false so 0 point will asign
result:
+---+----+------+------+
| ID|Name|Salary|Reward|
+---+----+------+------+
| 1| A| 10000| 200|
| 2| B| 200| 0|
| 3| C| 700| 200|
| 4| D| 1000| 200|
| 5| E| 650| 200|
+---+----+------+------+

You can piece something together with a filter expression and a groupby:
import pyspark.sql.functions as F
l = [
( 1, 'A', 10000, 'a AND b', 100),
( 1, 'A', 10000, 'a OR b', 200),
( 1, 'A', 10000,'otherwise', 0),
( 2, 'B', 200, 'a AND b', 100),
( 2, 'B', 200, 'a OR b', 200),
( 2, 'B', 200,'otherwise', 0),
( 3, 'C', 700, 'a AND b', 100),
( 3, 'C', 700, 'a OR b', 200),
( 3, 'C', 700,'otherwise', 0),
( 4, 'D', 1000, 'a AND b', 100),
( 4, 'D', 1000, 'a OR b', 200),
( 4, 'D', 1000,'otherwise', 0),
( 5, 'E', 650, 'a AND b', 100),
( 5, 'E', 650, 'a OR b', 200),
( 5, 'E', 650,'otherwise', 0)]
columns = ['ID','Name','Salary','Operation','Points']
df=spark.createDataFrame(l, columns)
df.filter(
(df.Operation.contains('AND') & (df.Salary == 1000) & (df.Salary > 500)) |
(df.Operation.contains('OR') & ((df.Salary == 1000) | (df.Salary > 500))) |
df.Operation.contains('otherwise')
).groupBy('ID', 'Name', 'Salary').agg(F.max('Points').alias('Rewards')).show()
Output:
+---+----+------+-------+
| ID|Name|Salary|Rewards|
+---+----+------+-------+
| 1| A| 10000| 200|
| 2| B| 200| 0|
| 3| C| 700| 200|
| 5| E| 650| 200|
| 4| D| 1000| 200|
+---+----+------+-------+
Please also have a look at a similar question and the answer of Shan.

Related

RowNumber with Reset

I am trying to achieve the expected output shown here:
+---+-----+--------+--------+--------+----+
| ID|State| Time|Expected|lagState|rank|
+---+-----+--------+--------+--------+----+
| 1| P|20220722| 1| null| 1|
| 1| P|20220723| 2| P| 2|
| 1| P|20220724| 3| P| 3|
| 1| P|20220725| 4| P| 4|
| 1| D|20220726| 1| P| 1|
| 1| O|20220727| 1| D| 1|
| 1| D|20220728| 1| O| 1|
| 1| P|20220729| 2| D| 1|
| 1| P|20220730| 3| P| 9|
| 1| P|20220731| 4| P| 10|
+---+-----+--------+--------+--------+----+
# create df
df = spark.createDataFrame(sc.parallelize([
[1, 'P', 20220722, 1],
[1, 'P', 20220723, 2],
[1, 'P', 20220724, 3],
[1, 'P', 20220725, 4],
[1, 'D', 20220726, 1],
[1, 'O', 20220727, 1],
[1, 'D', 20220728, 1],
[1, 'P', 20220729, 2],
[1, 'P', 20220730, 3],
[1, 'P', 20220731, 4],
]),
['ID', 'State', 'Time', 'Expected'])
# lag
df = df.withColumn('lagState', F.lag('State').over(w.partitionBy('id').orderBy('time')))
# rn
df = df.withColumn('rank', F.when( F.col('State') == F.col('lagState'), F.rank().over(w.partitionBy('id').orderBy('time', 'state'))).otherwise(1))
# view
df.show()
The general problem is that the tail of the DF is not resetting to the expected value as hoped.
data_sdf. \
withColumn('st_notsame',
func.coalesce(func.col('state') != func.lag('state').over(wd.partitionBy('id').orderBy('time')),
func.lit(False)).cast('int')
). \
withColumn('rank_temp',
func.sum('st_notsame').over(wd.partitionBy('id').orderBy('time').rowsBetween(-sys.maxsize, 0))
). \
withColumn('rank',
func.row_number().over(wd.partitionBy('id', 'rank_temp').orderBy('time'))
). \
show()
# +---+-----+--------+--------+----------+---------+----+
# | id|state| time|expected|st_notsame|rank_temp|rank|
# +---+-----+--------+--------+----------+---------+----+
# | 1| P|20220722| 1| 0| 0| 1|
# | 1| P|20220723| 2| 0| 0| 2|
# | 1| P|20220724| 3| 0| 0| 3|
# | 1| P|20220725| 4| 0| 0| 4|
# | 1| D|20220726| 1| 1| 1| 1|
# | 1| O|20220727| 1| 1| 2| 1|
# | 1| D|20220728| 1| 1| 3| 1|
# | 1| P|20220729| 2| 1| 4| 1|
# | 1| P|20220730| 3| 0| 4| 2|
# | 1| P|20220731| 4| 0| 4| 3|
# +---+-----+--------+--------+----------+---------+----+
your expected field looks a little incorrect. I believe the rank against "20220729" should be 1.
you first flag all the consecutive occurrences of the state as 0 and others as 1 - this'll enable you to do a running sum
use the sum window with infinite lookback for each id to get a temp rank
use the temp rank as a partition column to be used for row_number()

Filter specific records and earlier using window

I have the dataframe like
name
time
statut
A
1
in
A
2
out
A
3
in
A
4
out
A
5
in
B
1
in
B
4
in
B
7
out
B
18
in
I just want to get for each group the last time that I have statut = "out" and the row after. Like this:
name
time
statut
A
4
out
A
5
in
B
7
out
B
18
in
This can be done using a couple of window functions. However, the function which uses the window is not very simple.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[('A', 1, 'in'),
('A', 2, 'out'),
('A', 3, 'in'),
('A', 4, 'out'),
('A', 5, 'in'),
('B', 1, 'in'),
('B', 4, 'in'),
('B', 7, 'out'),
('B', 18, 'in'),
('B', 19, 'in'),
('C', 1, 'in')],
['name', 'time', 'statut']
)
w_max_out = W.partitionBy('name').orderBy(F.col('statut') != 'out', F.desc('time'))
w_lead = W.partitionBy('name').orderBy(F.desc('time'))
df = df.withColumn('_lead', F.lead('time').over(w_lead))
df = df.withColumn('_max_out', F.max(F.when(F.col('statut') == 'out', F.col('time'))).over(w_max_out))
df = df.filter('(_max_out = time) or (_max_out = _lead)').drop('_max_out', '_lead')
Result:
df.show()
# +----+----+------+
# |name|time|statut|
# +----+----+------+
# | A| 4| out|
# | A| 5| in|
# | B| 7| out|
# | B| 18| in|
# +----+----+------+
Result before the last filter line:
# +----+----+------+-----+--------+
# |name|time|statut|_lead|_max_out|
# +----+----+------+-----+--------+
# | A| 4| out| 3| 4|
# | A| 2| out| 1| 4|
# | A| 5| in| 4| 4|
# | A| 3| in| 2| 4|
# | A| 1| in| null| 4|
# | B| 7| out| 4| 7|
# | B| 19| in| 18| 7|
# | B| 18| in| 7| 7|
# | B| 4| in| 1| 7|
# | B| 1| in| null| 7|
# | C| 1| in| null| null|
# +----+----+------+-----+--------+

Pyspark StandardScaler over a Window

I want to use the standardscaler pyspark.ml.feature.StandardScaler over window of my data.
df4=spark.createDataFrame(
[
(1,1, 'X', 'a'),
(2,1, 'X', 'a'),
(3,9, 'X', 'b'),
(5,1, 'X', 'b'),
(6,2, 'X', 'c'),
(7,2, 'X', 'c'),
(8,10, 'Y', 'a'),
(9,45, 'Y', 'a'),
(10,3, 'Y', 'a'),
(11,3, 'Y', 'b'),
(12,6, 'Y', 'b'),
(13,19,'Y', 'b')
],
['id','feature', 'txt', 'cat']
)
w = Window().partitionBy(..)
I can do this over the whole dataframe by calling the .fit& .transform methods. But not on the w variable which we use generally like F.col('feature') - F.mean('feature').over(w).
I can transform all my windowed/grouped data into separate columns, put it into a dataframe and then apply StandardScaler over it and transform back to 1D. Is there any other method ? The ultimate goal is to try different scalers including pyspark.ml.feature.RobustScaler.
I eventually had to write my own scaler class. Using the pyspark StandardScaler in the above problem is not suitable as we all know it is more efficient for end to end series transformations. Nonetheless I came up with my own scaler. It does not really use Window from pyspark but i achieve the functionality using groupby.
class StandardScaler:
tol = 0.000001
def __init__(self, colsTotransform, groupbyCol='txt', orderBycol='id'):
self.colsTotransform = colsTotransform
self.groupbyCol=groupbyCol
self.orderBycol=orderBycol
def __tempNames__(self):
return [(f"{colname}_transformed",colname) for colname in self.colsTotransform]
def fit(self, df):
funcs = [(F.mean(name), F.stddev(name)) for name in self.colsTotransform]
exprs = [ff for tup in funcs for ff in tup]
self.stats = df.groupBy([self.groupbyCol]).agg(*exprs)
def __transformOne__(self, df_with_stats, newName, colName):
return df_with_stats\
.withColumn(newName,
(F.col(colName)-F.col(f'avg({colName})'))/(F.col(f'stddev_samp({colName})')+self.tol))\
.drop(colName)\
.withColumnRenamed(newName, colName)
def transform(self, df):
df_with_stats = df.join(self.stats, on=self.groupbyCol, how='inner').orderBy(self.orderBycol)
return reduce(lambda df_with_stats, kv: self.__transformOne__(df_with_stats, *kv),
self.__tempNames__(), df_with_stats)[df.columns]
Usage :
ss = StandardScaler(colsTotransform=['feature'],groupbyCol='txt',orderbyCol='id')
ss.fit(df4)
ss.stats.show()
+---+------------------+--------------------+
|txt| avg(feature)|stddev_samp(feature)|
+---+------------------+--------------------+
| Y|14.333333333333334| 16.169930941926335|
| X|2.6666666666666665| 3.1411250638372654|
+---+------------------+--------------------+
df4.show()
+---+-------+---+---+
| id|feature|txt|cat|
+---+-------+---+---+
| 1| 1| X| a|
| 2| 1| X| a|
| 3| 9| X| b|
| 5| 1| X| b|
| 6| 2| X| c|
| 7| 2| X| c|
| 8| 10| Y| a|
| 9| 45| Y| a|
| 10| 3| Y| a|
| 11| 3| Y| b|
| 12| 6| Y| b|
| 13| 19| Y| b|
+---+-------+---+---+
ss.transform(df4).show()
+---+--------------------+---+---+
| id| feature|txt|cat|
+---+--------------------+---+---+
| 1| -0.530595281053646| X| a|
| 2| -0.530595281053646| X| a|
| 3| 2.0162620680038548| X| b|
| 5| -0.530595281053646| X| b|
| 6|-0.21223811242145835| X| c|
| 7|-0.21223811242145835| X| c|
| 8| -0.2679871102053074| Y| a|
| 9| 1.8965241645298676| Y| a|
| 10| -0.7008893651523425| Y| a|
| 11| -0.7008893651523425| Y| b|
| 12| -0.5153598273178989| Y| b|
| 13| 0.2886015032980233| Y| b|
+---+--------------------+---+---+

Why sum is not displaying after aggregation & pivot?

Here I have student marks like below and I want to transpose subject name column and want to get the total marks also after the pivot.
Source table like:
+---------+-----------+-----+
|StudentId|SubjectName|Marks|
+---------+-----------+-----+
| 1| A| 10|
| 1| B| 20|
| 1| C| 30|
| 2| A| 20|
| 2| B| 25|
| 2| C| 30|
| 3| A| 10|
| 3| B| 20|
| 3| C| 20|
+---------+-----------+-----+
Destination:
+---------+---+---+---+-----+
|StudentId| A| B| C|Total|
+---------+---+---+---+-----+
| 1| 10| 20| 30| 60|
| 3| 10| 20| 20| 50|
| 2| 20| 25| 30| 75|
+---------+---+---+---+-----+
Please find the below source code:
val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val list = List((1, "A", 10), (1, "B", 20), (1, "C", 30), (2, "A", 20), (2, "B", 25), (2, "C", 30), (3, "A", 10),
(3, "B", 20), (3, "C", 20))
val df = list.toDF("StudentId", "SubjectName", "Marks")
df.show() // source table as per above
val df1 = df.groupBy("StudentId").pivot("SubjectName", Seq("A", "B", "C")).agg(sum("Marks"))
df1.show()
val df2 = df1.withColumn("Total", col("A") + col("B") + col("C"))
df2.show // required destitnation
val df3 = df.groupBy("StudentId").agg(sum("Marks").as("Total"))
df3.show()
df1 is not displaying the sum/total column. it's displaying like below.
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 10| 20| 30|
| 3| 10| 20| 20|
| 2| 20| 25| 30|
+---------+---+---+---+
df3 is able to create new Total column but why in df1 it not able to create a new column?
Please, can anybody help me what I missing or anything wrong with my understanding of pivot concept?
This is an expected behaviour from spark pivot function as .agg function is applied on the pivoted columns that's the reason why you are not able to see sum of marks as new column.
Refer to this link for official documentation about pivot.
Example:
scala> df.groupBy("StudentId").pivot("SubjectName").agg(sum("Marks") + 2).show()
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 12| 22| 32|
| 3| 12| 22| 22|
| 2| 22| 27| 32|
+---------+---+---+---+
In the above example we have added 2 to all the pivoted columns.
Example2:
To get count using pivot and agg
scala> df.groupBy("StudentId").pivot("SubjectName").agg(count("*")).show()
+---------+---+---+---+
|StudentId| A| B| C|
+---------+---+---+---+
| 1| 1| 1| 1|
| 3| 1| 1| 1|
| 2| 1| 1| 1|
+---------+---+---+---+
The .agg followed by pivot is applicable only for the pivoted data. To find the sum you should you should add new column and sum it as below.
val cols = Seq("A", "B", "C")
val result = df.groupBy("StudentId")
.pivot("SubjectName")
.agg(sum("Marks"))
.withColumn("Total", cols.map(col _).reduce(_ + _))
result.show(false)
Output:
+---------+---+---+---+-----+
|StudentId|A |B |C |Total|
+---------+---+---+---+-----+
|1 |10 |20 |30 |60 |
|3 |10 |20 |20 |50 |
|2 |20 |25 |30 |75 |
+---------+---+---+---+-----+

Pyspark apply function to column value if condition is met [duplicate]

This question already has answers here:
Spark Equivalent of IF Then ELSE
(4 answers)
Closed 3 years ago.
Given a pyspark dataframe, for example:
ls = [
['1', 2],
['2', 7],
['1', 3],
['2',-6],
['1', 3],
['1', 5],
['1', 4],
['2', 7]
]
df = spark.createDataFrame(pd.DataFrame(ls, columns=['col1', 'col2']))
df.show()
+----+-----+
|col1| col2|
+----+-----+
| 1| 2|
| 2| 7|
| 1| 3|
| 2| -6|
| 1| 3|
| 1| 5|
| 1| 4|
| 2| 7|
+----+-----+
How can I apply a function to col2 values where col1 == '1' and store result in a new column?
For example the function is:
f = x**2
Result should look like this:
+----+-----+-----+
|col1| col2| y|
+----+-----+-----+
| 1| 2| 4|
| 2| 7| null|
| 1| 3| 9|
| 2| -6| null|
| 1| 3| 9|
| 1| 5| 25|
| 1| 4| 16|
| 2| 7| null|
+----+-----+-----+
I tried defining a separate function, and use df.withColumn(y).when(condition,function) but it wouldn't work.
So what is a way to do this?
I hope this helps:
def myFun(x):
return (x**2).cast(IntegerType())
df2 = df.withColumn("y", when(df.col1 == 1, myFun(df.col2)).otherwise(None))
df2.show()
+----+----+----+
|col1|col2| y|
+----+----+----+
| 1| 2| 4|
| 2| 7|null|
| 1| 3| 9|
| 2| -6|null|
| 1| 3| 9|
| 1| 5| 25|
| 1| 4| 16|
| 2| 7|null|
+----+----+----+

Resources