pyspark dataframe -- Why are null values recognized differently in below scenarios? - apache-spark

Why isNull() behaves differently in below scenarios?
PySpark 1.6
Python 2.6.6
Definition of two dataframes:
df_t1 = sqlContext.sql("select 1 id, 9 num union all select 1 id, 2 num union all select 2 id, 3 num")
df_t2 = sqlContext.sql("select 1 id, 1 start, 3 stop union all select 3 id, 1 start, 9 stop")
Scenario 1:
df_t1.join(df_t2, (df_t1.id == df_t2.id) & (df_t1.num >= df_t2.start) & (df_t1.num <= df_t2.stop), "left").select([df_t2.start, df_t2.start.isNull()]).show()
Output 1:
+-----+-------------+
|start|isnull(start)|
+-----+-------------+
| null| false|
| 1| false|
| null| false|
+-----+-------------+
Scenario 2:
df_new=df_t1.join(df_t2, (df_t1.id == df_t2.id) & (df_t1.num >= df_t2.start) & (df_t1.num <= df_t2.stop), "left")
Output 2:
+-----+-------------+
|start|isnull(start)|
+-----+-------------+
| null| true|
| 1| false|
| null| true|
+-----+-------------+
Scenario 3:
df_t1.join(df_t2, (df_t1.id == df_t2.id) & (df_t1.num >= df_t2.start) & (df_t1.num <= df_t2.stop), "left").filter("start is null").show()
Output 3:
+---+---+----+-----+----+
| id|num| id|start|stop|
+---+---+----+-----+----+
| 1| 9|null| null|null|
| 2| 3|null| null|null|
+---+---+----+-----+----+
Thank you.

Related

How to Filter Record Grouped by a field in PySpark Dataframe Based on Rank and Values

I have a Pyspark dafaframe (Spark 2.2/Python 2.7) which has multiple records for each customer received on multiple days over a period of time. Here is how the simplified version of data looks like. These are ranked in order of dates (YYYY-MM-DD) when they are received for each group. Data is guaranteed to have multiple instances of each CUST_ID.
CUST_ID Date_received rank
1 2015-01-01 1
1 2021-01-12 2
1 2021-01-20 3
2 2015-01-01 1
2 2017-12-31 2
2 2021-02-15 3
3 2018-01-01 1
3 2019-07-31 2
4 2015-01-01 1
4 2021-01-01 2
4 2021-01-15 3
I want to split this data in 2 separate dataframes. First Dataframe should only have records fulfilling below criteria-
CUST_ID was received first time (rank 1) on 2015-01-01 and next time it was received (rank 2) on or after 2021-01-01. From above data example first Dataframe should have only these rows. This should happen for each group of CUST_ID
CUST_ID Date_received rank
1 2015-01-01 1
1 2021-01-12 2
4 2015-01-01 1
4 2021-01-01 2
And 2nd Dataframe should have rest-
CUST_ID Date_received rank
1 2021-01-20 3
2 2015-01-01 1
2 2017-12-31 2
2 2021-02-15 3
3 2018-01-01 1
3 2019-07-31 2
4 2021-01-15 3
You can calculate the conditions and broadcast the conditions for each CUST_ID:
from pyspark.sql import functions as F, Window
df0 = df.withColumn(
'flag1',
(F.col('rank') == 1) & (F.col('Date_received') == '2015-01-01')
).withColumn(
'flag2',
(F.col('rank') == 2) & (F.col('Date_received') >= '2021-01-01')
).withColumn(
'grp',
F.max('flag1').over(Window.partitionBy('CUST_ID')) &
F.max('flag2').over(Window.partitionBy('CUST_ID'))
)
df0.show()
+-------+-------------+----+-----+-----+-----+
|CUST_ID|Date_received|rank|flag1|flag2| grp|
+-------+-------------+----+-----+-----+-----+
| 3| 2018-01-01| 1|false|false|false|
| 3| 2019-07-31| 2|false|false|false|
| 1| 2015-01-01| 1| true|false| true|
| 1| 2021-01-12| 2|false| true| true|
| 1| 2021-01-20| 3|false|false| true|
| 4| 2015-01-01| 1| true|false| true|
| 4| 2021-01-01| 2|false| true| true|
| 4| 2021-01-15| 3|false|false| true|
| 2| 2015-01-01| 1| true|false|false|
| 2| 2017-12-31| 2|false|false|false|
| 2| 2021-02-15| 3|false|false|false|
+-------+-------------+----+-----+-----+-----+
Then you can divide the dataframe using the grp column:
df1 = df0.filter('grp and rank <= 2').select(df.columns)
df2 = df0.filter('not (grp and rank <= 2)').select(df.columns)
df1.show()
+-------+-------------+----+
|CUST_ID|Date_received|rank|
+-------+-------------+----+
| 1| 2015-01-01| 1|
| 1| 2021-01-12| 2|
| 4| 2015-01-01| 1|
| 4| 2021-01-01| 2|
+-------+-------------+----+
df2.show()
+-------+-------------+----+
|CUST_ID|Date_received|rank|
+-------+-------------+----+
| 3| 2018-01-01| 1|
| 3| 2019-07-31| 2|
| 1| 2021-01-20| 3|
| 4| 2021-01-15| 3|
| 2| 2015-01-01| 1|
| 2| 2017-12-31| 2|
| 2| 2021-02-15| 3|
+-------+-------------+----+

Loop 3 times and add a new value each time to a new column in spark DF

I want create 3 rows for every row in pysaprk DF. I wan to add a new column called loopVar=(val1,val2,val3). Three different values must be added as a value in each loop. Any idea how do I do it ?
Original:
a b c
1 2 3
1 2 3
Condition 1: loop = 1 and b is not null then loopvar =va1
Condition 2: loop = 2 and b is not null then loopvar =va2
Condition 3: loop = 3 and c is not null then loopvar =va3
Output :
a b c loopvar
1 2 3 val1
1 2 3 vall
1 2 3 val2
1 2 3 val2
1 2 3 val3
1 2 3 val3
Use a crossJoin:
df = spark.createDataFrame([[1,2,3], [1,2,3]]).toDF('a','b','c')
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 1| 2| 3|
+---+---+---+
df2 = spark.createDataFrame([['val1'], ['val2'], ['val3']]).toDF('loopvar')
df2.show()
+-------+
|loopvar|
+-------+
| val1|
| val2|
| val3|
+-------+
df3 = df.crossJoin(df2)
df3.show()
+---+---+---+-------+
| a| b| c|loopvar|
+---+---+---+-------+
| 1| 2| 3| val1|
| 1| 2| 3| val2|
| 1| 2| 3| val3|
| 1| 2| 3| val1|
| 1| 2| 3| val2|
| 1| 2| 3| val3|
+---+---+---+-------+

combine multiple row in Spark

I wonder if there is any easy way to combine multiple rows into one in Pyspark, I am new to Python and Spark and been using Spark.sql most of the time.
Here is a data example:
id count1 count2 count3
1 null 1 null
1 3 null null
1 null null 5
2 null 1 null
2 1 null null
2 null null 2
the expected output is :
id count1 count2 count3
1 3 1 5
2 1 1 2
I been using spark SQL to join them multiple times, and wonder if there is any easier way to do that.
Thank you!
Spark SQL will sum null as zero, so if you know there are no "overlapping" data elements, just group by the column you wish aggregate to and sum.
Assuming that you want to keep your original column names (and not sum the id column), you'll need to specify the columns that are summed and then rename them after the aggregation.
before.show()
+---+------+------+------+
| id|count1|count2|count3|
+---+------+------+------+
| 1| null| 1| null|
| 1| 3| null| null|
| 1| null| null| 5|
| 2| null| 1| null|
| 2| 1| null| null|
| 2| null| null| 2|
+---+------+------+------+
after = before
.groupby('id').sum(*[c for c in before.columns if c != 'id'])
.select([col(f"sum({c})").alias(c) for c in before.columns if c != 'id'])
after.show()
+------+------+------+
|count1|count2|count3|
+------+------+------+
| 3| 1| 5|
| 1| 1| 2|
+------+------+------+

Keep track of the previous row values with additional condition using pyspark

I'm using pyspark to generate a dataframe where I need to update 'amt' column with previous row's 'amt' value only when amt = 0.
For example, below is my dataframe
+---+-----+
| id|amt |
+---+-----+
| 1| 5|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
| 6| 3|
+---+-----+
Now, I want the following DF to be created. whenever amt = 0, modi_amt col will contain previous row's non zero value, else no change.
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 5|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
I'm able to get the previous rows value but need help for the rows where multiple 0 amt appears (example, id = 2,3)
code I'm using :
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
DF= DF.withColumn("modi_amt",when(DF.amt== 0,DF.prev_amt).otherwise(DF.amt)).drop('prev_amt')
I'm getting the below DF
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 0|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
basically id 3 also should have modi_amt = 5
I've used the below approach to get the output and it's working fine,
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
# this will hold the previous col value
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
# this will replace the amt 0 with previous column value, but not consecutive rows having 0 amt.
DF = DF.withColumn("amt_adjusted",when(DF.prev_amt == 0,DF.prev_OffSet).otherwise(DF.amt))
# define null for the rows where both amt and amt_adjusted are having 0 (logic for consecutive rows having 0 amt)
DF = DF.withColumn('zeroNonZero', when((DF.amt== 0)&(DF.amt_adjusted == 0),lit(None)).otherwise(DF.amt_adjusted))
# replace all null values with previous Non zero amt row value
DF= DF.withColumn('modi_amt',last("zeroNonZero", ignorenulls= True).over(Window.orderBy("id").rowsBetween(Window.unboundedPreceding,0)))
Is there any other better approach?

How to compare two dataframes and add new flag column in pyspark?

I have created two data frames by executing below command.
test1 = sc.parallelize([
("a",1,1),
("b",2,2),
("d",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
test1.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 2|
| d| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
test2=sc.parallelize([
("a",1,1),
("b",2,3),
("f",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
test2.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 3|
| f| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
using test1 and test2 data-frames I need to produce new dataframe which should contain result like below .
+---+--------+----------+------------+------------+
|SID|SSection|test1SRank|test2SRank | flag |
+---+--------+----------+------------+------------+
| a| 1| 1 | 1 | same_rank |
| b| 2| 2 | 3 |rank_changed|
| d| 4| 2 | 0 |No_rank |
| e| 4| 1 | 1 |same_rank |
| c| 3| 4 | 4 |same_rank |
| f| 4| 0 | 2 |new_rank |
+---+--------+----------+------------+------------+
above result I want to produce by comparison between test1 and test2 data frames using combination of columns SID and SSection and comparison between ranks.
for example :
1) SID (a) and SSection (1): in test1rank is1 and test2rank is 1 so my flag value should be same_rank.
2) SID (b) and SSection (2): in test1rank is 2 and test2rank is 3 here rank was changed so my flag value should be rank_changed.
3) SID (d) and SSection (4): in test1rank is 2 and in test2rank he lost his rank, so my flag value should be No_rank
4) SID (f) and SSection (4): in test1rank is he was not performed well so he don't have any rank and in test2rank he performed well his rank is 2, so my flag value should be New_rank
This should give you what you want:
from pyspark.sql import functions as f
test3=test1.withColumnRenamed('SRank','test1SRank')\
.join(test2.drop('SSection')\
.withColumnRenamed('SRank','test2SRank'), on='SID', how='outer')\
.fillna(0)
test3=test3.withColumn('flag', f.expr("case when test1SRank=0 and test2SRank>0 then 'new_rank'\
when test1SRank>0 and test2SRank=0 then 'No_rank'\
when test1SRank=test2SRank then 'same_rank'\
else 'rank_changed' end"))
test3.orderBy('SID').show()
Explanation: Outer join the data frame so you have test1 and test2 scores for all SIDs. Then fill nulls with 0 and perform the flag logic with a sql case when statement.

Resources