Using spark dataframe transpose - apache-spark

I am trying to figure out how to solve this use case using spark dataframe.
In the below google sheet, I have the source data where the questions from the survey answered by the people will be stored. Also the question columns will be more than 1000 columns approx, and is more dynamic and not fixed.
There is a metadata table, which explains about the question, its description and the choices it can contain.
Output table should be the like the one I had mentioned in the sheet. Any suggestions or ideas on how this can be achieved ?
https://docs.google.com/spreadsheets/d/1BAY8XWaio1DbzcQeQgru6PuNfT9A7Uhf650x_-PAjqo/edit#gid=0

Let's assume your main table is called df:
+---------+-----------+-----------+------+------+------+
|survey_id|response_id|person_name|Q1D102|Q1D103|Q1D105|
+---------+-----------+-----------+------+------+------+
|xyz |xyz |john |1 |2 |1 |
|abc |abc |foo |3 |1 |1 |
|def |def |bar |2 |2 |2 |
+---------+-----------+-----------+------+------+------+
and the mapping table is called df2:
+-----------+-------------+-------------------+---------+-----------+
|question_id|question_name|question_text |choice_id|choice_desc|
+-----------+-------------+-------------------+---------+-----------+
|Q1D102 |Gender |What is your gender|1 |Male |
|Q1D102 |Gender |What is your gender|2 |Female |
|Q1D102 |Gender |What is your gender|3 |Diverse |
|Q1D103 |Age |What is your age |1 |20 - 50 |
|Q1D103 |Age |What is your age |2 |50 > |
|Q1D105 |work_status |Do you work |1 |Yes |
|Q1D105 |work_status |Do you work |2 |No |
+-----------+-------------+-------------------+---------+-----------+
We can construct a dynamic unpivot expression as below:
val columns = df.columns.filter(c => c.startsWith("Q1D"))
val data = columns.map(c => s"'$c', $c").mkString(",")
val finalExpr = s"stack(${columns.length}, $data) as (question_id, choice_id)"
With 3 questions, we get the following expression (Q1D102, Q1D103 and Q1D105): stack(3, 'Q1D102', Q1D102,'Q1D103', Q1D103,'Q1D105', Q1D105) as (question_id, choice_id)
Finally, we use the constructed variable:
df = df
.selectExpr("survey_id", "response_id", "person_name", finalExpr)
.join(df2, Seq("question_id", "choice_id"), "left")
You get this result:
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|question_id|choice_id|survey_id|response_id|person_name|question_name|question_text |choice_desc|
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|Q1D102 |1 |xyz |xyz |john |Gender |What is your gender|Male |
|Q1D102 |2 |def |def |bar |Gender |What is your gender|Female |
|Q1D102 |3 |abc |abc |foo |Gender |What is your gender|Diverse |
|Q1D103 |1 |abc |abc |foo |Age |What is your age |20 - 50 |
|Q1D103 |2 |xyz |xyz |john |Age |What is your age |50 > |
|Q1D103 |2 |def |def |bar |Age |What is your age |50 > |
|Q1D105 |1 |xyz |xyz |john |work_status |Do you work |Yes |
|Q1D105 |1 |abc |abc |foo |work_status |Do you work |Yes |
|Q1D105 |2 |def |def |bar |work_status |Do you work |No |
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
Which I think is what you need (just unordered), good luck!

Related

How to pivot or transform data in ArrayType format in pyspark?

I have data in following format:
|cust_id |card_num |balance|payment |due |card_type|
|:-------|:--------|:------|:-------|:----|:------- |
|c1 |1234 |567 |344 |33 |A |
|c1 |2345 |57 |44 |3 |B |
|c2 |123 |561 |34 |39 |A |
|c3 |345 |517 |914 |23 |C |
|c3 |127 |56 |34 |32 |B |
|c3 |347 |67 |344 |332 |B |
I want it to be converted into following ArrayType.
|cust_id|card_num |balance |payment |due | card_type|
|:------|:-------- |:------ |:------- |:---- |:---- |
|c1 |[1234,2345] |[567,57] |[344,44] |[33,3] |[A,B] |
|c2 |[123] |[561] |[34] |[39] |[A] |
|c3 |[345,127,347]|[517,56,67]|914,34,344]|[23,32,332]|[C,B,B] |
How to write a generic code in pyspark to do this transformation and save it in csv format?
You just need to group by cust_id column and use collect_list function to get array type aggregated columns.
df = # input
df.groupBy("cust_id").agg(
collect_list("card_num").alias("card_num"),
collect_list("balance").alias("balance"),
collect_list("payment").alias("payment"),
collect_list("due").alias("due"),
collect_list("card_type").alias("card_type"))

Look back based on X days an get col values based on condition spark

I have the following DF:
--------------------------------
|Id |Date |Value |cond |
|-------------------------------|
|1 |2022-08-03 | 100| 1 |
|1 |2022-08-04 | 200| 2 |
|1 |2022-08-05 | 150| 3 |
|1 |2022-08-06 | 300| 4 |
|1 |2022-08-07 | 400| 5 |
|1 |2022-08-08 | 150| 6 |
|1 |2022-08-09 | 500| 7 |
|1 |2022-08-10 | 150| 8 |
|1 |2022-08-11 | 150| 9 |
|1 |2022-08-12 | 700| 1 |
|1 |2022-08-13 | 800| 2 |
|1 |2022-08-14 | 150| 2 |
|1 |2022-08-15 | 300| 0 |
|1 |2022-08-16 | 200| 1 |
|1 |2022-08-17 | 150| 3 |
|1 |2022-08-18 | 150| 1 |
|1 |2022-08-19 | 250| 4 |
|1 |2022-08-20 | 150| 5 |
|1 |2022-08-21 | 400| 6 |
|2 |2022-08-03 | 100| 1 |
|2 |2022-08-04 | 200| 2 |
|2 |2022-08-05 | 150| 1 |
|2 |2022-08-06 | 300| 1 |
|2 |2022-08-07 | 400| 1 |
|2 |2022-08-08 | 150| 1 |
|2 |2022-08-09 | 125| 1 |
|2 |2022-08-10 | 150| 1 |
|2 |2022-08-11 | 150| 3 |
|2 |2022-08-12 | 170| 6 |
|2 |2022-08-13 | 150| 7 |
|2 |2022-08-14 | 150| 8 |
|2 |2022-08-15 | 300| 1 |
|2 |2022-08-16 | 150| 9 |
|2 |2022-08-17 | 150| 0 |
|2 |2022-08-18 | 400| 1 |
|2 |2022-08-19 | 150| 1 |
|2 |2022-08-20 | 500| 1 |
|2 |2022-08-21 | 150| 1 |
--------------------------------
And this one:
---------------------
|Date | cond |
|-------------------|
|2022-08-03 | 1 |
|2022-08-04 | 2 |
|2022-08-05 | 1 |
|2022-08-06 | 1 |
|2022-08-07 | 1 |
|2022-08-08 | 1 |
|2022-08-09 | 1 |
|2022-08-10 | 1 |
|2022-08-11 | 3 |
|2022-08-12 | 6 |
|2022-08-13 | 8 |
|2022-08-14 | 9 |
|2022-08-15 | 1 |
|2022-08-16 | 2 |
|2022-08-17 | 2 |
|2022-08-18 | 0 |
|2022-08-19 | 1 |
|2022-08-20 | 3 |
|2022-08-21 | 1 |
--------------------
My expected output is:
-------------------------------
|Id |Date |Avg |Count|
|-----------------------------|
|1 |2022-08-03 | 0| 0 |
|1 |2022-08-04 | 0| 0 |
|1 |2022-08-05 | 0| 0 |
|1 |2022-08-06 | 0| 0 |
|1 |2022-08-07 | 0| 0 |
|1 |2022-08-08 | 0| 0 |
|1 |2022-08-09 | 0| 0 |
|1 |2022-08-10 | 0| 0 |
|1 |2022-08-11 | 0| 0 |
|1 |2022-08-12 | 0| 0 |
|1 |2022-08-13 | 0| 0 |
|1 |2022-08-14 | 0| 0 |
|1 |2022-08-15 | 0| 0 |
|1 |2022-08-16 | 0| 0 |
|1 |2022-08-17 | 0| 0 |
|1 |2022-08-18 | 0| 0 |
|1 |2022-08-19 | 0| 0 |
|1 |2022-08-20 | 0| 0 |
|1 |2022-08-21 | 0| 0 |
|2 |2022-08-03 | 0| 0 |
|2 |2022-08-04 | 0| 0 |
|2 |2022-08-05 | 0| 1 |
|2 |2022-08-06 | 0| 2 |
|2 |2022-08-07 | 0| 3 |
|2 |2022-08-08 | 237,5| 4 |
|2 |2022-08-09 | 250| 4 |
|2 |2022-08-10 |243,75| 4 |
|2 |2022-08-11 | 0| 0 |
|2 |2022-08-12 | 0| 0 |
|2 |2022-08-13 | 0| 0 |
|2 |2022-08-14 | 0| 0 |
|2 |2022-08-15 |206,25| 4 |
|2 |2022-08-16 | 0| 0 |
|2 |2022-08-17 | 0| 0 |
|2 |2022-08-18 | 0| 0 |
|2 |2022-08-19 |243,75| 4 |
|2 |2022-08-20 | 0| 0 |
|2 |2022-08-21 | 337,5| 4 |
-------------------------------
The algorithm is:
Verify if Date and Cond are the same in the first and second DFs.
If the condition is true, I need to lookback on DF1 four days (D-1, D-2, D-3, D-4) based on Cond and calculate the Average(Avg) and count of this values. If I have more then 4 days I need to use the top 4 values to calculate the Avg and Count is going to be always 4 in this case.
Example situations based on the inputs:
Id = 1, Date = 2022-08-08
Count is 0 because the condition is false, then Avg is 0 too.
Id = 2, Date = 2022-08-08
Count is 4 because the condition is true, then I get values of 2022-08-07, 2022-08-06, 2022-08-05, 2022-08-03. I exclude 2022-08-04 because Cond value there is 2, and the Date I'm using as reference Cond is 1.
Id = 2, Date = 2022-08-07
Count is 3 because the condition is true, but I have only the 3 values before that date, so I can't calculate the Avg since I need four values, so in that case Avg is zero.
I tried to use window function, but with no success. I was able to achieve the output DF using SQL (Joins with Outter Apply). But spark doesn't have outter apply. So, my doubts are:
How to generate the output DF.
What is the best the way the generate the output DF.
MVCE to generate the input DFs in pyspark:
data_1=[
("1","2022-08-03",100,1),
("1","2022-08-04",200,2),
("1","2022-08-05",150,3),
("1","2022-08-06",300,4),
("1","2022-08-07",400,5),
("1","2022-08-08",150,6),
("1","2022-08-09",500,7),
("1","2022-08-10",150,8),
("1","2022-08-11",150,9),
("1","2022-08-12",700,1),
("1","2022-08-13",800,2),
("1","2022-08-14",150,2),
("1","2022-08-15",300,0),
("1","2022-08-16",200,1),
("1","2022-08-17",150,3),
("1","2022-08-18",150,1),
("1","2022-08-19",250,4),
("1","2022-08-20",150,5),
("1","2022-08-21",400,6),
("2","2022-08-03",100,1),
("2","2022-08-04",200,2),
("2","2022-08-05",150,1),
("2","2022-08-06",300,1),
("2","2022-08-07",400,1),
("2","2022-08-08",150,1),
("2","2022-08-09",125,1),
("2","2022-08-10",150,1),
("2","2022-08-11",150,3),
("2","2022-08-12",170,6),
("2","2022-08-13",150,7),
("2","2022-08-14",150,8),
("2","2022-08-15",300,1),
("2","2022-08-16",150,9),
("2","2022-08-17",150,0),
("2","2022-08-18",400,1),
("2","2022-08-19",150,1),
("2","2022-08-20",500,1),
("2","2022-08-21",150,1)
]
schema_1 = StructType([
StructField("Id", StringType(),True),
StructField("Date", DateType(),True),
StructField("Value", IntegerType(),True),
StructField("Cond", IntegerType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
data_2 = [
("2022-08-03", 1),
("2022-08-04", 2),
("2022-08-05", 1),
("2022-08-06", 1),
("2022-08-07", 1),
("2022-08-08", 1),
("2022-08-09", 1),
("2022-08-10", 1),
("2022-08-11", 3),
("2022-08-12", 6),
("2022-08-13", 8),
("2022-08-14", 9),
("2022-08-15", 1),
("2022-08-16", 2),
("2022-08-17", 2),
("2022-08-18", 0),
("2022-08-19", 1),
("2022-08-20", 3),
("2022-08-21", 1)
]
schema_2 = StructType([
StructField("Date", DateType(),True),
StructField("Cond", IntegerType(),True)
])
df_2 = spark.createDataFrame(data=data_2,schema=schema_2)
UPDATE: I updated the question to be more clearly about the conditions to join the DFs!
Do a left join to get the dates you are interested in.
Then use pyspark.sql.window to get the values you need into a list and take size of this as Count.
Finally with the help of pyspark.sql.functions.aggregate get the Avg.
from pyspark.sql import functions as F, Window
# cast to date, and rename columns for later use
df_1 = df_1.withColumn("Date", F.col("Date").cast("date"))
df_2 = df_2.withColumn("Date", F.col("Date").cast("date"))
df_2 = df_2.withColumnRenamed("Date", "DateDf2")\
.withColumnRenamed("Cond", "CondDf2")
# left join
df = df_1.join(df_2, (df_1.Cond==df_2.CondDf2)&(df_1.Date==df_2.DateDf2), how='left')
windowSpec = Window.partitionBy("Id", "Cond").orderBy("Date")
# all the magic happens here!
df = (
# only start counting when "DateDf2" is not null, and put the values into a list
df.withColumn("value_list", F.when(F.isnull("DateDf2"), F.array()).otherwise(F.collect_list("Value").over(windowSpec.rowsBetween(-4, -1))))
.withColumn("Count", F.size("value_list"))
# use aggregate to sum up the list only if the size is 4! and divide by 4 to get average
.withColumn("Avg", F.when(F.col("count")==4, F.aggregate("value_list", F.lit(0), lambda acc,x: acc+x)/4).otherwise(F.lit(0)))
.select("Id", "Date", "Avg", "Count")
.orderBy("Id", "Date")
)
Output is:
+---+----------+------+-----+
|Id |Date |Avg |Count|
+---+----------+------+-----+
|1 |2022-08-03|0.0 |0 |
|1 |2022-08-04|0.0 |0 |
|1 |2022-08-05|0.0 |0 |
|1 |2022-08-06|0.0 |0 |
|1 |2022-08-07|0.0 |0 |
|1 |2022-08-08|0.0 |0 |
|1 |2022-08-09|0.0 |0 |
|1 |2022-08-10|0.0 |0 |
|1 |2022-08-11|0.0 |0 |
|1 |2022-08-12|0.0 |0 |
|1 |2022-08-13|0.0 |0 |
|1 |2022-08-14|0.0 |0 |
|1 |2022-08-15|0.0 |0 |
|1 |2022-08-16|0.0 |0 |
|1 |2022-08-17|0.0 |0 |
|1 |2022-08-18|0.0 |0 |
|1 |2022-08-19|0.0 |0 |
|1 |2022-08-20|0.0 |0 |
|1 |2022-08-21|0.0 |0 |
|2 |2022-08-03|0.0 |0 |
|2 |2022-08-04|0.0 |0 |
|2 |2022-08-05|0.0 |1 |
|2 |2022-08-06|0.0 |2 |
|2 |2022-08-07|0.0 |3 |
|2 |2022-08-08|237.5 |4 |
|2 |2022-08-09|250.0 |4 |
|2 |2022-08-10|243.75|4 |
|2 |2022-08-11|0.0 |0 |
|2 |2022-08-12|0.0 |0 |
|2 |2022-08-13|0.0 |0 |
|2 |2022-08-14|0.0 |0 |
|2 |2022-08-15|206.25|4 |
|2 |2022-08-16|0.0 |0 |
|2 |2022-08-17|0.0 |0 |
|2 |2022-08-18|0.0 |0 |
|2 |2022-08-19|243.75|4 |
|2 |2022-08-20|0.0 |0 |
|2 |2022-08-21|337.5 |4 |
+---+----------+------+-----+
here is the solution for the same
Solution:
from pyspark.sql import Window
import pyspark.sql.functions as F
df_1= df_1.withColumn("Date",F.col("Date").cast("timestamp"))
df_2= df_2.withColumn("Date",F.col("Date").cast("timestamp"))
window_spec = Window.partitionBy(["Id"]).orderBy("Date")
four_days_sld_wnd_exl_cuurent_row = Window.partitionBy(["Id"]).orderBy(["rnk"]).rangeBetween(-4, -1)
window_spec_count_cond_ = Window.partitionBy(["Id"]).orderBy(F.unix_timestamp("Date", 'yyyy-MM-dd') / 86400).rangeBetween(-4, -1)
agg_col_cond_ = (F.col("agg") ==0.0)
date_2_col_cond_ = (F.col("Date_2").isNull())
valid_4_days_agg_value =(F.when((~date_2_col_cond_) & (F.size(F.col("date_arrays_with_cond_1"))==4),
F.sum(F.col("Value")).over(four_days_sld_wnd_exl_cuurent_row)).otherwise(F.lit(0.0)))
count_cond_ = (F.when(~agg_col_cond_ & ~date_2_col_cond_,F.lit(4))
.when(agg_col_cond_ & date_2_col_cond_,F.lit(0))
.otherwise(F.size(F.collect_set(F.col("Date_2")).over(window_spec_count_cond_))))
df_jn = df_1.join(df_2,["Date","Cond"],"left")\
.select(df_1["*"],df_2["Date"].alias("Date_2")).orderBy("Id",df_1["Date"])
filter_having_cond_1=(F.col("Cond") == 1)
cond_columns_matching = (F.col("Date_2").isNull())
df_fnl_with_cond_val_1 = df_jn.filter(filter_having_cond_1)
df_fnl_with_cond_val_other=df_jn.filter(~filter_having_cond_1)\
.withColumn("agg",F.lit(0.0))\
.withColumn("count",F.lit(0))\
.drop("Date_2")
df_fnl_with_cond_val_1 = df_fnl_with_cond_val_1\
.withColumn("rnk",F.row_number().over(window_spec))\
.withColumn("date_arrays_with_cond_1", F.collect_set(F.col("Date")).over(four_days_sld_wnd_exl_cuurent_row))\
.withColumn("agg",valid_4_days_agg_value/4)\
.withColumn("count",count_cond_)\
.drop("date_arrays_with_cond_1","rnk","Date_2")
df_fnl = df_fnl_with_cond_val_1.unionByName(df_fnl_with_cond_val_other)
df_fnl.orderBy(["id","Date"]).show(50,0)
kindly upvote if you like my solution .
output
+---+-------------------+-----+----+------+-----+
|Id |Date |Value|Cond|agg |count|
+---+-------------------+-----+----+------+-----+
|1 |2022-08-03 00:00:00|100 |1 |0.0 |0 |
|1 |2022-08-04 00:00:00|200 |2 |0.0 |0 |
|1 |2022-08-05 00:00:00|150 |3 |0.0 |0 |
|1 |2022-08-06 00:00:00|300 |4 |0.0 |0 |
|1 |2022-08-07 00:00:00|400 |5 |0.0 |0 |
|1 |2022-08-08 00:00:00|150 |6 |0.0 |0 |
|1 |2022-08-09 00:00:00|500 |7 |0.0 |0 |
|1 |2022-08-10 00:00:00|150 |8 |0.0 |0 |
|1 |2022-08-11 00:00:00|150 |9 |0.0 |0 |
|1 |2022-08-12 00:00:00|700 |1 |0.0 |0 |
|1 |2022-08-13 00:00:00|800 |2 |0.0 |0 |
|1 |2022-08-14 00:00:00|150 |2 |0.0 |0 |
|1 |2022-08-15 00:00:00|300 |0 |0.0 |0 |
|1 |2022-08-16 00:00:00|200 |1 |0.0 |0 |
|1 |2022-08-17 00:00:00|150 |3 |0.0 |0 |
|1 |2022-08-18 00:00:00|150 |1 |0.0 |0 |
|1 |2022-08-19 00:00:00|250 |4 |0.0 |0 |
|1 |2022-08-20 00:00:00|150 |5 |0.0 |0 |
|1 |2022-08-21 00:00:00|400 |6 |0.0 |0 |
|2 |2022-08-03 00:00:00|100 |1 |0.0 |0 |
|2 |2022-08-04 00:00:00|200 |2 |0.0 |0 |
|2 |2022-08-05 00:00:00|150 |1 |0.0 |1 |
|2 |2022-08-06 00:00:00|300 |1 |0.0 |2 |
|2 |2022-08-07 00:00:00|400 |1 |0.0 |3 |
|2 |2022-08-08 00:00:00|150 |1 |237.5 |4 |
|2 |2022-08-09 00:00:00|125 |1 |250.0 |4 |
|2 |2022-08-10 00:00:00|150 |1 |243.75|4 |
|2 |2022-08-11 00:00:00|150 |3 |0.0 |0 |
|2 |2022-08-12 00:00:00|170 |6 |0.0 |0 |
|2 |2022-08-13 00:00:00|150 |7 |0.0 |0 |
|2 |2022-08-14 00:00:00|150 |8 |0.0 |0 |
|2 |2022-08-15 00:00:00|300 |1 |206.25|4 |
|2 |2022-08-16 00:00:00|150 |9 |0.0 |0 |
|2 |2022-08-17 00:00:00|150 |0 |0.0 |0 |
|2 |2022-08-18 00:00:00|400 |1 |0.0 |0 |
|2 |2022-08-19 00:00:00|150 |1 |243.75|4 |
|2 |2022-08-20 00:00:00|500 |1 |0.0 |0 |
|2 |2022-08-21 00:00:00|150 |1 |337.5 |4 |
+---+-------------------+-----+----+------+-----+

PySpark: How to calculate days between when last condition was met (positive vs negative)

Current DF (filter by a single userId, flag is 1 when the loss is > 0, -1 when is <=0):
display(df):
+------+----------+---------+----+
| user|Date |RealLoss |flag|
+------+----------+---------+----+
|100364|2019-02-01| -16.5| 1|
|100364|2019-02-02| 73.5| -1|
|100364|2019-02-03| 31| -1|
|100364|2019-02-09| -5.2| 1|
|100364|2019-02-10| -34.5| 1|
|100364|2019-02-13| -8.1| 1|
|100364|2019-02-18| 5.68| -1|
|100364|2019-02-19| 5.76| -1|
|100364|2019-02-20| 9.12| -1|
|100364|2019-02-26| 9.4| -1|
|100364|2019-02-27| -30.6| 1|
+----------+------+---------+----+
the desidered outcome df should show the number of days since lastwin ('RecencyLastWin') and since lastloss ('RecencyLastLoss')
display(df):
+------+----------+---------+----+--------------+---------------+
| user|Date |RealLoss |flag|RecencyLastWin|RecencyLastLoss|
+------+----------+---------+----+--------------+---------------+
|100364|2019-02-01| -16.5| 1| null| null|
|100364|2019-02-02| 73.5| -1| 1| null|
|100364|2019-02-03| 31| -1| 2| 1|
|100364|2019-02-09| -5.2| 1| 8| 6|
|100364|2019-02-10| -34.5| 1| 1| 7|
|100364|2019-02-13| -8.1| 1| 1| 10|
|100364|2019-02-18| 5.68| -1| 5| 15|
|100364|2019-02-19| 5.76| -1| 6| 1|
|100364|2019-02-20| 9.12| -1| 7| 1|
|100364|2019-02-26| 9.4| -1| 13| 6|
|100364|2019-02-27| -30.6| 1| 14| 1|
+----------+------+---------+----+--------------+---------------+
My approach was the following:
from pyspark.sql.window import Window
w = Window.partitionBy("userId", 'PlayerSiteCode').orderBy("EventDate")
last_positive = check.filter('flag = "1"').withColumn('last_positive_day' , F.lag('EventDate').over(w))
last_negative = check.filter('flag = "-1"').withColumn('last_negative_day' , F.lag('EventDate').over(w))
finalcheck = check.join(last_positive.select('userId', 'PlayerSiteCode', 'EventDate', 'last_positive_day'), ['userId', 'PlayerSiteCode', 'EventDate'], how = 'left')\
.join(last_negative.select('userId', 'PlayerSiteCode', 'EventDate', 'last_negative_day'), ['userId', 'PlayerSiteCode', 'EventDate'], how = 'left')\
.withColumn('previous_date_played' , F.lag('EventDate').over(w))\
.withColumn('last_positive_day_count', F.datediff(F.col('EventDate'), F.col('last_positive_day')))\
.withColumn('last_negative_day_count', F.datediff(F.col('EventDate'), F.col('last_negative_day')))
then I tried to add (multiple attempts..) but failed to 'perfectly' return what I want.
finalcheck = finalcheck.withColumn('previous_last_pos' , F.last('last_positive_day_count', True).over(w2))\
.withColumn('previous_last_neg' , F.last('last_negative_day_count', True).over(w2))\
.withColumn('previous_last_pos_date' , F.last('last_positive_day', True).over(w2))\
.withColumn('previous_last_neg_date' , F.last('last_negative_day', True).over(w2))\
.withColumn('recency_last_positive' , F.datediff(F.col('EventDate'), F.col('previous_last_pos_date')))\
.withColumn('day_since_last_negative_v1' , F.datediff(F.col('EventDate'), F.col('previous_last_neg_date')))\
.withColumn('days_off' , F.datediff(F.col('EventDate'), F.col('previous_date_played')))\
.withColumn('recency_last_negative' , F.when((F.col('day_since_last_negative_v1').isNull()), F.col('days_off')).otherwise(F.col('day_since_last_negative_v1')))\
.withColumn('recency_last_negative_v2' , F.when((F.col('last_negative_day').isNull()), F.col('days_off')).otherwise(F.col('day_since_last_negative_v1')))\
.withColumn('recency_last_positive_v2' , F.when((F.col('last_positive_day').isNull()), F.col('days_off')).otherwise(F.col('recency_last_positive')))
Any suggestion/tips?
(I found a similar question but didn't figured out how to apply in my specific case):
How to calculate days between when last condition was met?
Here is my try.
There are two parts to calculate this. The first one is that when the wins and losses keep going, then the difference of dates should be summed. To achieve this, I have marked the consecutive losses and wins as 1, and split them into the partition groups by cumulative summing until the current row of the marker. Then, I can calculate the cumulative days from the last loss or win after the consecutive losses and wins the end.
The second one is that when the wins and losses changed, simply get the date difference from the last match and this match. It can be easily obtained by the date difference of current and previous one.
Finally, merge those results in a column.
from pyspark.sql.functions import lag, col, sum
from pyspark.sql import Window
w1 = Window.orderBy('Date')
w2 = Window.partitionBy('groupLossCheck').orderBy('Date')
w3 = Window.partitionBy('groupWinCheck').orderBy('Date')
df2 = df.withColumn('lastFlag', lag('flag', 1).over(w1)) \
.withColumn('lastDate', lag('Date', 1).over(w1)) \
.withColumn('dateDiff', expr('datediff(Date, lastDate)')) \
.withColumn('consecutiveLoss', expr('if(flag = 1 or lastFlag = 1, 0, 1)')) \
.withColumn('consecutiveWin' , expr('if(flag = -1 or lastFlag = -1, 0, 1)')) \
.withColumn('groupLossCheck', sum('consecutiveLoss').over(w1)) \
.withColumn('groupWinCheck' , sum('consecutiveWin' ).over(w1)) \
.withColumn('daysLastLoss', sum(when((col('consecutiveLoss') == 0) & (col('groupLossCheck') != 0), col('dateDiff'))).over(w2)) \
.withColumn('daysLastwin' , sum(when((col('consecutiveWin' ) == 0) & (col('groupWinCheck' ) != 0), col('dateDiff'))).over(w3)) \
.withColumn('lastLoss', expr('if(lastFlag = -1, datediff, null)')) \
.withColumn('lastWin' , expr('if(lastFlag = 1, dateDiff, null)')) \
.withColumn('RecencyLastLoss', coalesce('lastLoss', 'daysLastLoss')) \
.withColumn('RecencyLastWin', coalesce('lastWin' , 'daysLastwin' )) \
.orderBy('Date')
df2.show(11, False)
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
|user |Date |RealLoss|flag|lastFlag|lastDate |dateDiff|consecutiveLoss|consecutiveWin|groupLossCheck|groupWinCheck|daysLastLoss|daysLastwin|lastLoss|lastWin|RecencyLastLoss|RecencyLastWin|
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
|100364|2019-02-01|-16.5 |1 |null |null |null |0 |1 |0 |1 |null |null |null |null |null |null |
|100364|2019-02-02|73.5 |-1 |1 |2019-02-01|1 |0 |0 |0 |1 |null |1 |null |1 |null |1 |
|100364|2019-02-03|31.0 |-1 |-1 |2019-02-02|1 |1 |0 |1 |1 |null |2 |1 |null |1 |2 |
|100364|2019-02-09|-5.2 |1 |-1 |2019-02-03|6 |0 |0 |1 |1 |6 |8 |6 |null |6 |8 |
|100364|2019-02-10|-34.5 |1 |1 |2019-02-09|1 |0 |1 |1 |2 |7 |null |null |1 |7 |1 |
|100364|2019-02-13|-8.1 |1 |1 |2019-02-10|3 |0 |1 |1 |3 |10 |null |null |3 |10 |3 |
|100364|2019-02-18|5.68 |-1 |1 |2019-02-13|5 |0 |0 |1 |3 |15 |5 |null |5 |15 |5 |
|100364|2019-02-19|5.76 |-1 |-1 |2019-02-18|1 |1 |0 |2 |3 |null |6 |1 |null |1 |6 |
|100364|2019-02-20|9.12 |-1 |-1 |2019-02-19|1 |1 |0 |3 |3 |null |7 |1 |null |1 |7 |
|100364|2019-02-26|9.4 |-1 |-1 |2019-02-20|6 |1 |0 |4 |3 |null |13 |6 |null |6 |13 |
|100364|2019-02-27|-30.6 |1 |-1 |2019-02-26|1 |0 |0 |4 |3 |1 |14 |1 |null |1 |14 |
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
df2.select(*df.columns, 'RecencyLastLoss', 'RecencyLastWin').show(11, False)
+------+----------+--------+----+---------------+--------------+
|user |Date |RealLoss|flag|RecencyLastLoss|RecencyLastWin|
+------+----------+--------+----+---------------+--------------+
|100364|2019-02-01|-16.5 |1 |null |null |
|100364|2019-02-02|73.5 |-1 |null |1 |
|100364|2019-02-03|31.0 |-1 |1 |2 |
|100364|2019-02-09|-5.2 |1 |6 |8 |
|100364|2019-02-10|-34.5 |1 |7 |1 |
|100364|2019-02-13|-8.1 |1 |10 |3 |
|100364|2019-02-18|5.68 |-1 |15 |5 |
|100364|2019-02-19|5.76 |-1 |1 |6 |
|100364|2019-02-20|9.12 |-1 |1 |7 |
|100364|2019-02-26|9.4 |-1 |6 |13 |
|100364|2019-02-27|-30.6 |1 |1 |14 |
+------+----------+--------+----+---------------+--------------+

Can we reorder spark dataframe's columns?

I am creating dataframe as per given schema, after that i want to create new dataframe by reordering the existing dataframe.
Can it be possible the re-ordering of columns in spark dataframe?
object Demo extends Context {
def main(args: Array[String]): Unit = {
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import sparkSession.sqlContext.implicits._
val empDF = emp.toDF(empColumns: _*)
empDF.show(false)
}
}
Current DF:
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
I want output as this following df, where gender and salary column re-ordered
New DF:
+------+--------+------+------+---------------+-----------+-----------+
|emp_id|name |gender|salary|superior_emp_id|year_joined|emp_dept_id|
+------+--------+------+------+---------------+-----------+-----------+
|1 |Smith |M |3000 |-1 |2018 |10 |
|2 |Rose |M |4000 |1 |2010 |20 |
|3 |Williams|M |1000 |1 |2010 |10 |
|4 |Jones |F |2000 |2 |2005 |10 |
|5 |Brown | |-1 |2 |2010 |40 |
|6 |Brown | |-1 |2 |2010 |50 |
+------+--------+------+------+---------------+-----------+-----------+
Just use select() to re-order the columns:
df = df.select('emp_id','name','gender','salary','superior_emp_id','year_joined','emp_dept_id')
It will be shown according to your ordering in select() argument.
Scala way of doing it
//Order the column names as you want
val columns = Array("emp_id","name","gender","salary","superior_emp_id","year_joined","emp_dept_id")
.map(col)
//Pass it to select
df.select(columns: _*)

Spark Window function has sliding window behavior when it is ordered

I have a dataset which looks like this:
+---+-------------------------------+--------+
|key|value |someData|
+---+-------------------------------+--------+
|1 |AAA |5 |
|1 |VVV |6 |
|1 |DDDD |8 |
|3 |rrerw |9 |
|4 |RRRRR |13 |
|6 |AAAAABB |15 |
|6 |C:\Windows\System32\svchost.exe|20 |
+---+-------------------------------+--------+
Now, I apply aggregative avg function twice, first over ordered Window, later on unordered window, the results are not the same example:
WindowSpec windowSpec = Window.orderBy(col("someData")).partitionBy(col("key"));
rawMapping.withColumn("avg", avg("someData").over(windowSpec)).show(false);
+---+-------------------------------+--------+-----------------+
|key|value |someData|avg |
+---+-------------------------------+--------+-----------------+
|1 |AAA |5 |5.0 |
|1 |VVV |6 |5.5 |
|1 |DDDD |8 |6.333333333333333|
|6 |AAAAABB |15 |15.0 |
|6 |C:\Windows\System32\svchost.exe|20 |17.5 |
|3 |rrerw |9 |9.0 |
|4 |RRRRR |13 |13.0 |
+---+-------------------------------+--------+-----------------+
WindowSpec windowSpec2 = Window.partitionBy(col("key"));
rawMapping.withColumn("avg", avg("someData").over(windowSpec2)).show(false);
+---+-------------------------------+--------+-----------------+
|key|value |someData|avg |
+---+-------------------------------+--------+-----------------+
|1 |AAA |5 |6.333333333333333|
|1 |VVV |6 |6.333333333333333|
|1 |DDDD |8 |6.333333333333333|
|6 |AAAAABB |15 |17.5 |
|6 |C:\Windows\System32\svchost.exe|20 |17.5 |
|3 |rrerw |9 |9.0 |
|4 |RRRRR |13 |13.0 |
+---+-------------------------------+--------+-----------------+
When the window is oredered, the aggregative function has a "sliding window" behavior, why is this happening? and more importantly, is it a bug or a feature?

Resources