pyspark sqlfunction expr function not working as expected? - apache-spark

pyspark sqlfunction expr not working as expected.
my test1.txt contains
101|10|4
101|12|1
101|13|3
101|14|2
my test2.txt contains
101|10|4
101|11|1
101|13|3
101|14|2
I have created two dataframes using above data like below code.
df3 = spark.createDataFrame(sc.textFile("C://Users//cravi//Desktop//test1.txt").map( lambda x: x.split("|")[:3]),["cid","pid","pr"])
df4 = spark.createDataFrame(sc.textFile("C://Users//cravi//Desktop//test2.txt").map( lambda x: x.split("|")[:3]),["cid","pid","p"])
df5=df4.withColumnRenamed("p", "p")\
.join(df3.withColumnRenamed("pr", "Pr")\
, ["cid", "pid"], "outer")\
.na.fill(0)
tt=df5.withColumn('flag', sf.expr("case when p>0 and pr=='null' then 'N'\
when p=0 and Pr>0 then 'D'\
when p=Pr then 'R'\
else 'U' end"))
tt.show()
I am getting output like below
+---+---+----+----+----+
|cid|pid| p| Pr|flag|
+---+---+----+----+----+
|101| 14| 2| 2| R|
|101| 10| 4| 4| R|
|101| 11| 1|null| U|
|101| 12|null| 1| U|
|101| 13| 3| 3| R|
+---+---+----+----+----+
pyspark sqlfunction expr not working as expected.
if p and pr is same then my falg will be 'R'.
if p some value and pr is null my flag will be 'N'
if p is null and pr is some value my flag is 'D'
other case my flag is 'U'
In this case expected output is :
+---+---+----+----+----+
|cid|pid| p| Pr|flag|
+---+---+----+----+----+
|101| 14| 2| 2| R|
|101| 10| 4| 4| R|
|101| 11| 1|null| N|
|101| 12|null| 1| D|
|101| 13| 3| 3| R|
+---+---+----+----+----+

isNull and isNotNull inbuilt functions should solve your issue which can be used in query as
tt=df5.withColumn('flag', sf.expr("case when isNotNull(`p`) and isNull(`pr`) then 'N'\
when isNull(`p`) and isNotNull(`Pr`) then 'D'\
when p=Pr then 'R'\
else 'U' end"))
Thus you should get
+---+---+----+----+----+
|cid|pid| p| Pr|flag|
+---+---+----+----+----+
|101| 14| 2| 2| R|
|101| 10| 4| 4| R|
|101| 11| 1|null| N|
|101| 12|null| 1| D|
|101| 13| 3| 3| R|
+---+---+----+----+----+
Note: na.fill(0) is useless as it is not applied since the columns are StringType()
I hope the answer is helpful

Related

window function on a subset of data

I have a table like the below. I want to calculate an average of median but only for Q=2 and Q=3. I don't want to include other Qs but still preserve the data.
df = spark.createDataFrame([('2018-03-31',6,1),('2018-03-31',27,2),('2018-03-31',3,3),('2018-03-31',44,4),('2018-06-30',6,1),('2018-06-30',4,3),('2018-06-30',32,2),('2018-06-30',112,4),('2018-09-30',2,1),('2018-09-30',23,4),('2018-09-30',37,3),('2018-09-30',3,2)],['date','median','Q'])
+----------+--------+---+
| date| median | Q |
+----------+--------+---+
|2018-03-31| 6| 1|
|2018-03-31| 27| 2|
|2018-03-31| 3| 3|
|2018-03-31| 44| 4|
|2018-06-30| 6| 1|
|2018-06-30| 4| 3|
|2018-06-30| 32| 2|
|2018-06-30| 112| 4|
|2018-09-30| 2| 1|
|2018-09-30| 23| 4|
|2018-09-30| 37| 3|
|2018-09-30| 3| 2|
+----------+--------+---+
Expected output:
+----------+--------+---+------------+
| date| median | Q |result |
+----------+--------+---+------------+
|2018-03-31| 6| 1| null|
|2018-03-31| 27| 2| 15|
|2018-03-31| 3| 3| 15|
|2018-03-31| 44| 4| null|
|2018-06-30| 6| 1| null|
|2018-06-30| 4| 3| 18|
|2018-06-30| 32| 2| 18|
|2018-06-30| 112| 4| null|
|2018-09-30| 2| 1| null|
|2018-09-30| 23| 4| null|
|2018-09-30| 37| 3| 20|
|2018-09-30| 3| 2| 20|
+----------+--------+---+------------+
OR
+----------+--------+---+------------+
| date| median | Q |result |
+----------+--------+---+------------+
|2018-03-31| 6| 1| 15|
|2018-03-31| 27| 2| 15|
|2018-03-31| 3| 3| 15|
|2018-03-31| 44| 4| 15|
|2018-06-30| 6| 1| 18|
|2018-06-30| 4| 3| 18|
|2018-06-30| 32| 2| 18|
|2018-06-30| 112| 4| 18|
|2018-09-30| 2| 1| 20|
|2018-09-30| 23| 4| 20|
|2018-09-30| 37| 3| 20|
|2018-09-30| 3| 2| 20|
+----------+--------+---+------------+
I tried the following code but when I include the where statement it drops Q=1 and Q=4.
window = (
Window
.partitionBy("date")
.orderBy("date")
)
df_avg = (
df
.where(
(F.col("Q") == 2) |
(F.col("Q") == 3)
)
.withColumn("result", F.avg("median").over(window))
)
For both of your expected output, you can use conditional aggregation, use avg with when (otherwise).
If you want the 1st expected output.
window = (
Window
.partitionBy("date", F.col("Q").isin([2, 3]))
)
df_avg = (
df.withColumn("result", F.when(F.col("Q").isin([2, 3]), F.avg("median").over(window)))
)
For the 2nd expected output.
window = (
Window
.partitionBy("date")
)
df_avg = (
df.withColumn("result", F.avg(F.when(F.col("Q").isin([2, 3]), F.col("median"))).over(window))
)
Alternatively, since you are really aggregating a (small?) subset, replace window with auto-join:
>>> df_avg = df.where(col("Q").isin([2,3])).groupBy("date","Q").agg(avg("median").alias("result"))
>>> df_result = df.join(df_avg,["date","Q"],"left")
Might turn out to be faster than using window.

Conditions in Spark window function

I have a dataframe like
+---+---+---+---+
| q| w| e| r|
+---+---+---+---+
| a| 1| 20| y|
| a| 2| 22| z|
| b| 3| 10| y|
| b| 4| 12| y|
+---+---+---+---+
I want to mark the rows with the minimum e and r = z . If there are no rows which have r = z, I want the row with the minimum e, even if r = y.
Essentially, something like
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 1| 20| y| 0|
| a| 2| 22| z| 1|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
I can do it using a number of joins, but that would be too expensive.
So I was looking for a window-based solution.
You can calculate the minimum per group once for rows with r = z and then for all rows within a group. The first non-null value can then be compared to e:
from pyspark.sql import functions as F
from pyspark.sql import Window
df = ...
w = Window.partitionBy("q")
#When ordering is not defined, an unbounded window frame is used by default.
df.withColumn("min_e_with_r_eq_z", F.expr("min(case when r='z' then e else null end)").over(w)) \
.withColumn("min_e_overall", F.min("e").over(w)) \
.withColumn("t", F.coalesce("min_e_with_r_eq_z","min_e_overall") == F.col("e")) \
.orderBy("w") \
.show()
Output:
+---+---+---+---+-----------------+-------------+-----+
| q| w| e| r|min_e_with_r_eq_z|min_e_overall| t|
+---+---+---+---+-----------------+-------------+-----+
| a| 1| 20| y| 22| 20|false|
| a| 2| 22| z| 22| 20| true|
| b| 3| 10| y| null| 10| true|
| b| 4| 12| y| null| 10|false|
+---+---+---+---+-----------------+-------------+-----+
Note: I assume that q is the grouping column for the window.
You can assign row numbers based on whether r = z and the value of column e:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
't',
F.when(
F.row_number().over(
Window.partitionBy('q')
.orderBy((F.col('r') == 'z').desc(), 'e')
) == 1,
1
).otherwise(0)
)
df2.show()
+---+---+---+---+---+
| q| w| e| r| t|
+---+---+---+---+---+
| a| 2| 22| z| 1|
| a| 1| 20| y| 0|
| b| 3| 10| y| 1|
| b| 4| 12| y| 0|
+---+---+---+---+---+
Adding the spark-scala version of #werner 's accepted answer
val w = Window.partitionBy("q")
df.withColumn("min_e_with_r_eq_z", min(when($"r" === "z", $"e").otherwise(null)).over(w))
.withColumn("min_e_overall", min("e").over(w))
.withColumn("t", coalesce($"min_e_with_r_eq_z", $"min_e_overall") === $"e")
.orderBy("w")
.show()

create unique id for combination of a pair of values from two columns in a spark dataframe

I have a spark dataframe of six columns say (col1, col2,...col6). I want to create a unique id for each combination of values from "col1" and "col2" and add it to the dataframe. Can someone help me with some pyspark code on how to do it?
You can achieve it using monotonically_increasing_id(pyspark >1.6) or monotonicallyIncreasingId(pyspark <1.6)
>>> from pyspark.sql.functions import monotonically_increasing_id
>>> rdd=sc.parallelize([[12,23,3,4,5,6],[12,23,56,67,89,20],[12,23,0,0,0,0],[12,2,12,12,12,23],[1,2,3,4,56,7],[1,2,3,4,56,7]])
>>> df = rdd.toDF(['col_1','col_2','col_3','col_4','col_5','col_6'])
>>> df.show()
+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|
+-----+-----+-----+-----+-----+-----+
| 12| 23| 3| 4| 5| 6|
| 12| 23| 56| 67| 89| 20|
| 12| 23| 0| 0| 0| 0|
| 12| 2| 12| 12| 12| 23|
| 1| 2| 3| 4| 56| 7|
| 1| 2| 3| 4| 56| 7|
+-----+-----+-----+-----+-----+-----+
>>> df_1=df.groupBy(df.col_1,df.col_2).count().withColumn("id", monotonically_increasing_id()).select(['col_1','col_2','id'])
>>> df_1.show()
+-----+-----+-------------+
|col_1|col_2| id|
+-----+-----+-------------+
| 12| 23| 34359738368|
| 1| 2|1434519076864|
| 12| 2|1554778161152|
+-----+-----+-------------+
>>> df.join(df_1,(df.col_1==df_1.col_1) & (df.col_2==df_1.col_2)).drop(df_1.col_1).drop(df_1.col_2).show()
+-----+-----+-----+-----+-----+-----+-------------+
|col_3|col_4|col_5|col_6|col_1|col_2| id|
+-----+-----+-----+-----+-----+-----+-------------+
| 3| 4| 5| 6| 12| 23| 34359738368|
| 56| 67| 89| 20| 12| 23| 34359738368|
| 0| 0| 0| 0| 12| 23| 34359738368|
| 3| 4| 56| 7| 1| 2|1434519076864|
| 3| 4| 56| 7| 1| 2|1434519076864|
| 12| 12| 12| 23| 12| 2|1554778161152|
+-----+-----+-----+-----+-----+-----+-------------+
If you really need to generate the unique ID from col1 and col2 you can also create a hash value leveraging the sha2 function of Spark.
First let's generate some dummy data with:
from random import randint
max_range = 10
df1 = spark.createDataFrame(
[(x, x * randint(1, max_range), x * 10 * randint(1, max_range)) for x in range(1, max_range)],
['C1', 'C2', 'C3'])
>>> df1.show()
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 1| 1| 60|
| 2| 14|180|
| 3| 21|270|
| 4| 16|360|
| 5| 35|250|
| 6| 30|480|
| 7| 28|210|
| 8| 80|320|
| 9| 45|360|
+---+---+---+
Then create a new uid column from columns C2 and C3 with the next code:
from pyspark.sql.functions import col, sha2, concat
df1.withColumn("uid", sha2(concat(col("C2"), col("C3")), 256)).show(10, False)
And the output:
+---+---+---+--------------------+
| C1| C2| C3| uid|
+---+---+---+--------------------+
| 1| 1| 60|a512db2741cd20693...|
| 2| 14|180|2f6543dc6c0e06e4a...|
| 3| 21|270|bd3c65ddde4c6f733...|
| 4| 16|360|c7a1e8c59fc9dcc21...|
| 5| 35|250|cba1aeb7a72d9ae27...|
| 6| 30|480|ad7352ff8927cf790...|
| 7| 28|210|ea7bc25aa7cd3503f...|
| 8| 80|320|02e1d953517339552...|
| 9| 45|360|b485cf8f710a65755...|
+---+---+---+--------------------+

update pyspark data frame column based on another column

Below is a data frame in pyspark. I want to update the column val in data frame based on the values in tests column.
df.show()
+---------+----+---+
| tests| val|asd|
+---------+----+---+
| test1| Y| 1|
| test2| N| 2|
| test2| Y| 1|
| test1| N| 2|
| test1| N| 3|
| test3| N| 4|
| test4| Y| 5|
+---------+----+---+
I want to update the value when the any given test has val Y then all val's of that particular tests should be updated to Y. if not then what ever values they have.
basically I want the data frame to be like below.
result_df.show()
+---------+----+---+
| tests| val|asd|
+---------+----+---+
| test1| Y| 1|
| test2| Y| 2|
| test2| Y| 1|
| test1| Y| 2|
| test1| Y| 3|
| test3| N| 4|
| test4| Y| 5|
+---------+----+---+
What should I do to achieve that.
Use max window function and selectExpr:
df.selectExpr(
'tests', 'max(val) over (partition by tests) as val', 'asd'
).show()
+-----+---+---+
|tests|val|asd|
+-----+---+---+
|test4| Y| 5|
|test3| N| 4|
|test1| Y| 1|
|test1| Y| 2|
|test1| Y| 3|
|test2| Y| 2|
|test2| Y| 1|
+-----+---+---+
Here is a solution.
First we find out for each test whether it has val Y.
import pyspark.sql.functions as sf
by_test = df.groupBy('tests').agg(sf.sum((sf.col('val') == 'Y').cast('int')).alias('HasY'))
by_test.show()
+-----+----+
|tests|HasY|
+-----+----+
|test4| 1|
|test3| 0|
|test1| 1|
|test2| 1|
+-----+----+
Join back to the origine dataframe
df = df.join(by_test, on='tests')
df.show()
+-----+---+---+----+
|tests|val|asd|HasY|
+-----+---+---+----+
|test4| Y| 5| 1|
|test3| N| 4| 0|
|test1| Y| 1| 1|
|test1| N| 2| 1|
|test1| N| 3| 1|
|test2| N| 2| 1|
|test2| Y| 1| 1|
+-----+---+---+----+
Create a new column with the same name using when/otherwise
df = df.withColumn('val', sf.when(sf.col('HasY') > 0, 'Y').otherwise(sf.col('val')))
df = df.drop('HasY')
df.show()
+-----+---+---+
|tests|val|asd|
+-----+---+---+
|test4| Y| 5|
|test3| N| 4|
|test1| Y| 1|
|test1| Y| 2|
|test1| Y| 3|
|test2| Y| 2|
|test2| Y| 1|
+-----+---+---+

Joining two data frames and result data frames contain non duplicate items in PySpark?

I have created two data frames by executing below command. I want to
join the two data frames and result data frames contain non duplicate items in PySpark.
df1 = sc.parallelize([
("a",1,1),
("b",2,2),
("d",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
df1.show()
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 1| 1|
| b| 2| 2|
| d| 4| 2|
| e| 4| 1|
| c| 3| 4|
+---+--------+-----+
df2 is
df2=sc.parallelize([
("a",2,1),
("b",2,3),
("f",4,2),
("e",4,1),
("c",3,4)]).toDF(['SID','SSection','SRank'])
+---+--------+-----+
|SID|SSection|SRank|
+---+--------+-----+
| a| 2| 1|
| b| 2| 3|
| f| 4| 2|
| e| 4| 1|
| c| 3| 4|ggVG
+---+--------+-----+
I want to join above two tables like below.
+---+--------+----------+----------+
|SID|SSection|test1SRank|test2SRank|
+---+--------+----------+----------+
| f| 4| 0| 2|
| e| 4| 1| 1|
| d| 4| 2| 0|
| c| 3| 4| 4|
| b| 2| 2| 3|
| a| 1| 1| 0|
| a| 2| 0| 1|
+---+--------+----------+----------+
Doesn't look like something that can be achieved with a single join. Here's a solution involving multiple joins:
from pyspark.sql.functions import col
d1 = df1.unionAll(df2).select("SID" , "SSection" ).distinct()
t1 = d1.join(df1 , ["SID", "SSection"] , "leftOuter").select(d1.SID , d1.SSection , col("SRank").alias("test1Srank"))
t2 = d1.join(df2 , ["SID", "SSection"] , "leftOuter").select(d1.SID , d1.SSection , col("SRank").alias("test2Srank"))
t1.join(t2, ["SID", "SSection"]).na.fill(0).show()
+---+--------+----------+----------+
|SID|SSection|test1Srank|test2Srank|
+---+--------+----------+----------+
| b| 2| 2| 3|
| c| 3| 4| 4|
| d| 4| 2| 0|
| e| 4| 1| 1|
| f| 4| 0| 2|
| a| 1| 1| 0|
| a| 2| 0| 1|
+---+--------+----------+----------+
You can simply rename the SRank column names and use outer join and use na.fill function
df1.withColumnRenamed("SRank", "test1SRank").join(df2.withColumnRenamed("SRank", "test2SRank"), ["SID", "SSection"], "outer").na.fill(0)

Resources