Exploding an array into 2 columns - apache-spark

Suppose we want to track the hops made by an package from warehouse to the customer.
We have a table which store the data but the data is in a column SAY Route
The package starts at the Warehouse – YYY,TTT,MMM
The hops end when the package is delivered to the CUSTOMER
The values in the Route column are separated by space
ID Route
1 TTT A B X Y Z CUSTOMER
2 YYY E Y F G I P B X Q CUSTOMER
3 MMM R T K L CUSTOMER
Expected Output
ID START END
1 TTT A
1 A B
1 B X
.
.
.
1 Z CUSTOMER
2 YYY E
2 E Y
2 Y F
.
.
2 Q CUSTOMER
3 MMM R
.
.
3 L CUSTOMER
Is there anyway to achieve this in pyspark

Add an index to the split route using posexplode, and get the location at the next index for each starting location using lead. If you want to remove the index simply add .drop('index') at the end.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df2 = df.select(
'ID',
F.posexplode(F.split('Route', ' ')).alias('index', 'start')
).withColumn(
'end',
F.lead('start').over(Window.partitionBy('ID').orderBy('index'))
).orderBy('ID', 'index').dropna()
df2.show(99,0)
+---+-----+-----+--------+
|ID |index|start|end |
+---+-----+-----+--------+
|1 |0 |TTT |A |
|1 |1 |A |B |
|1 |2 |B |X |
|1 |3 |X |Y |
|1 |4 |Y |Z |
|1 |5 |Z |CUSTOMER|
|2 |0 |YYY |E |
|2 |1 |E |Y |
|2 |2 |Y |F |
|2 |3 |F |G |
|2 |4 |G |I |
|2 |5 |I |P |
|2 |6 |P |B |
|2 |7 |B |X |
|2 |8 |X |Q |
|2 |9 |Q |CUSTOMER|
|3 |0 |MMM |R |
|3 |1 |R |T |
|3 |2 |T |K |
|3 |3 |K |L |
|3 |4 |L |CUSTOMER|
+---+-----+-----+--------+

Related

Look back based on X days an get col values based on condition spark

I have the following DF:
--------------------------------
|Id |Date |Value |cond |
|-------------------------------|
|1 |2022-08-03 | 100| 1 |
|1 |2022-08-04 | 200| 2 |
|1 |2022-08-05 | 150| 3 |
|1 |2022-08-06 | 300| 4 |
|1 |2022-08-07 | 400| 5 |
|1 |2022-08-08 | 150| 6 |
|1 |2022-08-09 | 500| 7 |
|1 |2022-08-10 | 150| 8 |
|1 |2022-08-11 | 150| 9 |
|1 |2022-08-12 | 700| 1 |
|1 |2022-08-13 | 800| 2 |
|1 |2022-08-14 | 150| 2 |
|1 |2022-08-15 | 300| 0 |
|1 |2022-08-16 | 200| 1 |
|1 |2022-08-17 | 150| 3 |
|1 |2022-08-18 | 150| 1 |
|1 |2022-08-19 | 250| 4 |
|1 |2022-08-20 | 150| 5 |
|1 |2022-08-21 | 400| 6 |
|2 |2022-08-03 | 100| 1 |
|2 |2022-08-04 | 200| 2 |
|2 |2022-08-05 | 150| 1 |
|2 |2022-08-06 | 300| 1 |
|2 |2022-08-07 | 400| 1 |
|2 |2022-08-08 | 150| 1 |
|2 |2022-08-09 | 125| 1 |
|2 |2022-08-10 | 150| 1 |
|2 |2022-08-11 | 150| 3 |
|2 |2022-08-12 | 170| 6 |
|2 |2022-08-13 | 150| 7 |
|2 |2022-08-14 | 150| 8 |
|2 |2022-08-15 | 300| 1 |
|2 |2022-08-16 | 150| 9 |
|2 |2022-08-17 | 150| 0 |
|2 |2022-08-18 | 400| 1 |
|2 |2022-08-19 | 150| 1 |
|2 |2022-08-20 | 500| 1 |
|2 |2022-08-21 | 150| 1 |
--------------------------------
And this one:
---------------------
|Date | cond |
|-------------------|
|2022-08-03 | 1 |
|2022-08-04 | 2 |
|2022-08-05 | 1 |
|2022-08-06 | 1 |
|2022-08-07 | 1 |
|2022-08-08 | 1 |
|2022-08-09 | 1 |
|2022-08-10 | 1 |
|2022-08-11 | 3 |
|2022-08-12 | 6 |
|2022-08-13 | 8 |
|2022-08-14 | 9 |
|2022-08-15 | 1 |
|2022-08-16 | 2 |
|2022-08-17 | 2 |
|2022-08-18 | 0 |
|2022-08-19 | 1 |
|2022-08-20 | 3 |
|2022-08-21 | 1 |
--------------------
My expected output is:
-------------------------------
|Id |Date |Avg |Count|
|-----------------------------|
|1 |2022-08-03 | 0| 0 |
|1 |2022-08-04 | 0| 0 |
|1 |2022-08-05 | 0| 0 |
|1 |2022-08-06 | 0| 0 |
|1 |2022-08-07 | 0| 0 |
|1 |2022-08-08 | 0| 0 |
|1 |2022-08-09 | 0| 0 |
|1 |2022-08-10 | 0| 0 |
|1 |2022-08-11 | 0| 0 |
|1 |2022-08-12 | 0| 0 |
|1 |2022-08-13 | 0| 0 |
|1 |2022-08-14 | 0| 0 |
|1 |2022-08-15 | 0| 0 |
|1 |2022-08-16 | 0| 0 |
|1 |2022-08-17 | 0| 0 |
|1 |2022-08-18 | 0| 0 |
|1 |2022-08-19 | 0| 0 |
|1 |2022-08-20 | 0| 0 |
|1 |2022-08-21 | 0| 0 |
|2 |2022-08-03 | 0| 0 |
|2 |2022-08-04 | 0| 0 |
|2 |2022-08-05 | 0| 1 |
|2 |2022-08-06 | 0| 2 |
|2 |2022-08-07 | 0| 3 |
|2 |2022-08-08 | 237,5| 4 |
|2 |2022-08-09 | 250| 4 |
|2 |2022-08-10 |243,75| 4 |
|2 |2022-08-11 | 0| 0 |
|2 |2022-08-12 | 0| 0 |
|2 |2022-08-13 | 0| 0 |
|2 |2022-08-14 | 0| 0 |
|2 |2022-08-15 |206,25| 4 |
|2 |2022-08-16 | 0| 0 |
|2 |2022-08-17 | 0| 0 |
|2 |2022-08-18 | 0| 0 |
|2 |2022-08-19 |243,75| 4 |
|2 |2022-08-20 | 0| 0 |
|2 |2022-08-21 | 337,5| 4 |
-------------------------------
The algorithm is:
Verify if Date and Cond are the same in the first and second DFs.
If the condition is true, I need to lookback on DF1 four days (D-1, D-2, D-3, D-4) based on Cond and calculate the Average(Avg) and count of this values. If I have more then 4 days I need to use the top 4 values to calculate the Avg and Count is going to be always 4 in this case.
Example situations based on the inputs:
Id = 1, Date = 2022-08-08
Count is 0 because the condition is false, then Avg is 0 too.
Id = 2, Date = 2022-08-08
Count is 4 because the condition is true, then I get values of 2022-08-07, 2022-08-06, 2022-08-05, 2022-08-03. I exclude 2022-08-04 because Cond value there is 2, and the Date I'm using as reference Cond is 1.
Id = 2, Date = 2022-08-07
Count is 3 because the condition is true, but I have only the 3 values before that date, so I can't calculate the Avg since I need four values, so in that case Avg is zero.
I tried to use window function, but with no success. I was able to achieve the output DF using SQL (Joins with Outter Apply). But spark doesn't have outter apply. So, my doubts are:
How to generate the output DF.
What is the best the way the generate the output DF.
MVCE to generate the input DFs in pyspark:
data_1=[
("1","2022-08-03",100,1),
("1","2022-08-04",200,2),
("1","2022-08-05",150,3),
("1","2022-08-06",300,4),
("1","2022-08-07",400,5),
("1","2022-08-08",150,6),
("1","2022-08-09",500,7),
("1","2022-08-10",150,8),
("1","2022-08-11",150,9),
("1","2022-08-12",700,1),
("1","2022-08-13",800,2),
("1","2022-08-14",150,2),
("1","2022-08-15",300,0),
("1","2022-08-16",200,1),
("1","2022-08-17",150,3),
("1","2022-08-18",150,1),
("1","2022-08-19",250,4),
("1","2022-08-20",150,5),
("1","2022-08-21",400,6),
("2","2022-08-03",100,1),
("2","2022-08-04",200,2),
("2","2022-08-05",150,1),
("2","2022-08-06",300,1),
("2","2022-08-07",400,1),
("2","2022-08-08",150,1),
("2","2022-08-09",125,1),
("2","2022-08-10",150,1),
("2","2022-08-11",150,3),
("2","2022-08-12",170,6),
("2","2022-08-13",150,7),
("2","2022-08-14",150,8),
("2","2022-08-15",300,1),
("2","2022-08-16",150,9),
("2","2022-08-17",150,0),
("2","2022-08-18",400,1),
("2","2022-08-19",150,1),
("2","2022-08-20",500,1),
("2","2022-08-21",150,1)
]
schema_1 = StructType([
StructField("Id", StringType(),True),
StructField("Date", DateType(),True),
StructField("Value", IntegerType(),True),
StructField("Cond", IntegerType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
data_2 = [
("2022-08-03", 1),
("2022-08-04", 2),
("2022-08-05", 1),
("2022-08-06", 1),
("2022-08-07", 1),
("2022-08-08", 1),
("2022-08-09", 1),
("2022-08-10", 1),
("2022-08-11", 3),
("2022-08-12", 6),
("2022-08-13", 8),
("2022-08-14", 9),
("2022-08-15", 1),
("2022-08-16", 2),
("2022-08-17", 2),
("2022-08-18", 0),
("2022-08-19", 1),
("2022-08-20", 3),
("2022-08-21", 1)
]
schema_2 = StructType([
StructField("Date", DateType(),True),
StructField("Cond", IntegerType(),True)
])
df_2 = spark.createDataFrame(data=data_2,schema=schema_2)
UPDATE: I updated the question to be more clearly about the conditions to join the DFs!
Do a left join to get the dates you are interested in.
Then use pyspark.sql.window to get the values you need into a list and take size of this as Count.
Finally with the help of pyspark.sql.functions.aggregate get the Avg.
from pyspark.sql import functions as F, Window
# cast to date, and rename columns for later use
df_1 = df_1.withColumn("Date", F.col("Date").cast("date"))
df_2 = df_2.withColumn("Date", F.col("Date").cast("date"))
df_2 = df_2.withColumnRenamed("Date", "DateDf2")\
.withColumnRenamed("Cond", "CondDf2")
# left join
df = df_1.join(df_2, (df_1.Cond==df_2.CondDf2)&(df_1.Date==df_2.DateDf2), how='left')
windowSpec = Window.partitionBy("Id", "Cond").orderBy("Date")
# all the magic happens here!
df = (
# only start counting when "DateDf2" is not null, and put the values into a list
df.withColumn("value_list", F.when(F.isnull("DateDf2"), F.array()).otherwise(F.collect_list("Value").over(windowSpec.rowsBetween(-4, -1))))
.withColumn("Count", F.size("value_list"))
# use aggregate to sum up the list only if the size is 4! and divide by 4 to get average
.withColumn("Avg", F.when(F.col("count")==4, F.aggregate("value_list", F.lit(0), lambda acc,x: acc+x)/4).otherwise(F.lit(0)))
.select("Id", "Date", "Avg", "Count")
.orderBy("Id", "Date")
)
Output is:
+---+----------+------+-----+
|Id |Date |Avg |Count|
+---+----------+------+-----+
|1 |2022-08-03|0.0 |0 |
|1 |2022-08-04|0.0 |0 |
|1 |2022-08-05|0.0 |0 |
|1 |2022-08-06|0.0 |0 |
|1 |2022-08-07|0.0 |0 |
|1 |2022-08-08|0.0 |0 |
|1 |2022-08-09|0.0 |0 |
|1 |2022-08-10|0.0 |0 |
|1 |2022-08-11|0.0 |0 |
|1 |2022-08-12|0.0 |0 |
|1 |2022-08-13|0.0 |0 |
|1 |2022-08-14|0.0 |0 |
|1 |2022-08-15|0.0 |0 |
|1 |2022-08-16|0.0 |0 |
|1 |2022-08-17|0.0 |0 |
|1 |2022-08-18|0.0 |0 |
|1 |2022-08-19|0.0 |0 |
|1 |2022-08-20|0.0 |0 |
|1 |2022-08-21|0.0 |0 |
|2 |2022-08-03|0.0 |0 |
|2 |2022-08-04|0.0 |0 |
|2 |2022-08-05|0.0 |1 |
|2 |2022-08-06|0.0 |2 |
|2 |2022-08-07|0.0 |3 |
|2 |2022-08-08|237.5 |4 |
|2 |2022-08-09|250.0 |4 |
|2 |2022-08-10|243.75|4 |
|2 |2022-08-11|0.0 |0 |
|2 |2022-08-12|0.0 |0 |
|2 |2022-08-13|0.0 |0 |
|2 |2022-08-14|0.0 |0 |
|2 |2022-08-15|206.25|4 |
|2 |2022-08-16|0.0 |0 |
|2 |2022-08-17|0.0 |0 |
|2 |2022-08-18|0.0 |0 |
|2 |2022-08-19|243.75|4 |
|2 |2022-08-20|0.0 |0 |
|2 |2022-08-21|337.5 |4 |
+---+----------+------+-----+
here is the solution for the same
Solution:
from pyspark.sql import Window
import pyspark.sql.functions as F
df_1= df_1.withColumn("Date",F.col("Date").cast("timestamp"))
df_2= df_2.withColumn("Date",F.col("Date").cast("timestamp"))
window_spec = Window.partitionBy(["Id"]).orderBy("Date")
four_days_sld_wnd_exl_cuurent_row = Window.partitionBy(["Id"]).orderBy(["rnk"]).rangeBetween(-4, -1)
window_spec_count_cond_ = Window.partitionBy(["Id"]).orderBy(F.unix_timestamp("Date", 'yyyy-MM-dd') / 86400).rangeBetween(-4, -1)
agg_col_cond_ = (F.col("agg") ==0.0)
date_2_col_cond_ = (F.col("Date_2").isNull())
valid_4_days_agg_value =(F.when((~date_2_col_cond_) & (F.size(F.col("date_arrays_with_cond_1"))==4),
F.sum(F.col("Value")).over(four_days_sld_wnd_exl_cuurent_row)).otherwise(F.lit(0.0)))
count_cond_ = (F.when(~agg_col_cond_ & ~date_2_col_cond_,F.lit(4))
.when(agg_col_cond_ & date_2_col_cond_,F.lit(0))
.otherwise(F.size(F.collect_set(F.col("Date_2")).over(window_spec_count_cond_))))
df_jn = df_1.join(df_2,["Date","Cond"],"left")\
.select(df_1["*"],df_2["Date"].alias("Date_2")).orderBy("Id",df_1["Date"])
filter_having_cond_1=(F.col("Cond") == 1)
cond_columns_matching = (F.col("Date_2").isNull())
df_fnl_with_cond_val_1 = df_jn.filter(filter_having_cond_1)
df_fnl_with_cond_val_other=df_jn.filter(~filter_having_cond_1)\
.withColumn("agg",F.lit(0.0))\
.withColumn("count",F.lit(0))\
.drop("Date_2")
df_fnl_with_cond_val_1 = df_fnl_with_cond_val_1\
.withColumn("rnk",F.row_number().over(window_spec))\
.withColumn("date_arrays_with_cond_1", F.collect_set(F.col("Date")).over(four_days_sld_wnd_exl_cuurent_row))\
.withColumn("agg",valid_4_days_agg_value/4)\
.withColumn("count",count_cond_)\
.drop("date_arrays_with_cond_1","rnk","Date_2")
df_fnl = df_fnl_with_cond_val_1.unionByName(df_fnl_with_cond_val_other)
df_fnl.orderBy(["id","Date"]).show(50,0)
kindly upvote if you like my solution .
output
+---+-------------------+-----+----+------+-----+
|Id |Date |Value|Cond|agg |count|
+---+-------------------+-----+----+------+-----+
|1 |2022-08-03 00:00:00|100 |1 |0.0 |0 |
|1 |2022-08-04 00:00:00|200 |2 |0.0 |0 |
|1 |2022-08-05 00:00:00|150 |3 |0.0 |0 |
|1 |2022-08-06 00:00:00|300 |4 |0.0 |0 |
|1 |2022-08-07 00:00:00|400 |5 |0.0 |0 |
|1 |2022-08-08 00:00:00|150 |6 |0.0 |0 |
|1 |2022-08-09 00:00:00|500 |7 |0.0 |0 |
|1 |2022-08-10 00:00:00|150 |8 |0.0 |0 |
|1 |2022-08-11 00:00:00|150 |9 |0.0 |0 |
|1 |2022-08-12 00:00:00|700 |1 |0.0 |0 |
|1 |2022-08-13 00:00:00|800 |2 |0.0 |0 |
|1 |2022-08-14 00:00:00|150 |2 |0.0 |0 |
|1 |2022-08-15 00:00:00|300 |0 |0.0 |0 |
|1 |2022-08-16 00:00:00|200 |1 |0.0 |0 |
|1 |2022-08-17 00:00:00|150 |3 |0.0 |0 |
|1 |2022-08-18 00:00:00|150 |1 |0.0 |0 |
|1 |2022-08-19 00:00:00|250 |4 |0.0 |0 |
|1 |2022-08-20 00:00:00|150 |5 |0.0 |0 |
|1 |2022-08-21 00:00:00|400 |6 |0.0 |0 |
|2 |2022-08-03 00:00:00|100 |1 |0.0 |0 |
|2 |2022-08-04 00:00:00|200 |2 |0.0 |0 |
|2 |2022-08-05 00:00:00|150 |1 |0.0 |1 |
|2 |2022-08-06 00:00:00|300 |1 |0.0 |2 |
|2 |2022-08-07 00:00:00|400 |1 |0.0 |3 |
|2 |2022-08-08 00:00:00|150 |1 |237.5 |4 |
|2 |2022-08-09 00:00:00|125 |1 |250.0 |4 |
|2 |2022-08-10 00:00:00|150 |1 |243.75|4 |
|2 |2022-08-11 00:00:00|150 |3 |0.0 |0 |
|2 |2022-08-12 00:00:00|170 |6 |0.0 |0 |
|2 |2022-08-13 00:00:00|150 |7 |0.0 |0 |
|2 |2022-08-14 00:00:00|150 |8 |0.0 |0 |
|2 |2022-08-15 00:00:00|300 |1 |206.25|4 |
|2 |2022-08-16 00:00:00|150 |9 |0.0 |0 |
|2 |2022-08-17 00:00:00|150 |0 |0.0 |0 |
|2 |2022-08-18 00:00:00|400 |1 |0.0 |0 |
|2 |2022-08-19 00:00:00|150 |1 |243.75|4 |
|2 |2022-08-20 00:00:00|500 |1 |0.0 |0 |
|2 |2022-08-21 00:00:00|150 |1 |337.5 |4 |
+---+-------------------+-----+----+------+-----+

Adding with Window Functions, from specific value

I am struggling with a (Py)Spark problem.
I have a column "col" in an ordered dataframe and need a way of adding up the elements from 0. What I need is the column "sum_from_0".
I tried it with window functions but did not succeed.
Any idea on how to solve this task would be appreciated.
Thank you in advance.
col sum_from_0
0 None
0 None
1 1
2 3
1 4
4 8
3 11
0 None
0 None
0 None
1 1
2 3
3 6
3 9
2 11
0 None
0 None
There is no ordering column, so I made it first and add some temp columns to separate sum groups. After that, sum over the group partition and order by id window such as
import org.apache.spark.sql.expressions.Window
val w1 = Window.orderBy("id")
val w2 = Window.partitionBy("group").orderBy("id")
df.withColumn("id", monotonically_increasing_id)
.withColumn("zero", (col("col") === 0).cast("int"))
.withColumn("group", sum("zero").over(w1))
.withColumn("sum_from_0", sum("col").over(w2))
.orderBy("id")
.drop("id", "group", "zero")
.show(20, false)
that gives the results:
+---+----------+
|col|sum_from_0|
+---+----------+
|0 |0 |
|0 |0 |
|1 |1 |
|2 |3 |
|1 |4 |
|4 |8 |
|3 |11 |
|0 |0 |
|0 |0 |
|0 |0 |
|1 |1 |
|2 |3 |
|3 |6 |
|3 |9 |
|2 |11 |
|0 |0 |
|0 |0 |
+---+----------+

Can we reorder spark dataframe's columns?

I am creating dataframe as per given schema, after that i want to create new dataframe by reordering the existing dataframe.
Can it be possible the re-ordering of columns in spark dataframe?
object Demo extends Context {
def main(args: Array[String]): Unit = {
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import sparkSession.sqlContext.implicits._
val empDF = emp.toDF(empColumns: _*)
empDF.show(false)
}
}
Current DF:
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
I want output as this following df, where gender and salary column re-ordered
New DF:
+------+--------+------+------+---------------+-----------+-----------+
|emp_id|name |gender|salary|superior_emp_id|year_joined|emp_dept_id|
+------+--------+------+------+---------------+-----------+-----------+
|1 |Smith |M |3000 |-1 |2018 |10 |
|2 |Rose |M |4000 |1 |2010 |20 |
|3 |Williams|M |1000 |1 |2010 |10 |
|4 |Jones |F |2000 |2 |2005 |10 |
|5 |Brown | |-1 |2 |2010 |40 |
|6 |Brown | |-1 |2 |2010 |50 |
+------+--------+------+------+---------------+-----------+-----------+
Just use select() to re-order the columns:
df = df.select('emp_id','name','gender','salary','superior_emp_id','year_joined','emp_dept_id')
It will be shown according to your ordering in select() argument.
Scala way of doing it
//Order the column names as you want
val columns = Array("emp_id","name","gender","salary","superior_emp_id","year_joined","emp_dept_id")
.map(col)
//Pass it to select
df.select(columns: _*)

finding non-overlapping windows in a pyspark dataframe

Suppose I have a pyspark dataframe with an id column and a time column (t) in seconds. For each id I'd like to group the rows so that each group has all entries that are within 5 seconds after the start time for that group. So for instance, if the table is:
+---+--+
|id |t |
+---+--+
|1 |0 |
|1 |1 |
|1 |3 |
|1 |8 |
|1 |14|
|1 |18|
|2 |0 |
|2 |20|
|2 |21|
|2 |50|
+---+--+
Then the result should be:
+---+--+---------+-------------+-------+
|id |t |subgroup |window_start |offset |
+---+--+---------+-------------+-------+
|1 |0 |1 |0 |0 |
|1 |1 |1 |0 |1 |
|1 |3 |1 |0 |3 |
|1 |8 |2 |8 |0 |
|1 |14|3 |14 |0 |
|1 |18|3 |14 |4 |
|2 |0 |1 |0 |0 |
|2 |20|2 |20 |0 |
|2 |21|2 |20 |1 |
|2 |50|3 |50 |0 |
+---+--+---------+-------------+-------+
I don't need the subgroup numbers to be consecutive. I'm ok with solutions using custom UDAF in Scala as long as it is efficient.
Computing (cumsum(t)-(cumsum(t)%5))/5 within each group can be used to identify the first window, but not the ones beyond that. Essentially the problem is that after the first window is found, the cumulative sum needs to reset to 0. I could operate recursively using this cumulative sum approach, but that is too inefficient on a large dataset.
The following works and is more efficient than recursively calling cumsum, but it is still so slow as to be useless on large dataframes.
d = [[int(x[0]),float(x[1])] for x in [[1,0],[1,1],[1,4],[1,7],[1,14],[1,18],[2,5],[2,20],[2,21],[3,0],[3,1],[3,1.5],[3,2],[3,3.5],[3,4],[3,6],[3,6.5],[3,7],[3,11],[3,14],[3,18],[3,20],[3,24],[4,0],[4,1],[4,2],[4,6],[4,7]]]
schema = pyspark.sql.types.StructType(
[
pyspark.sql.types.StructField('id',pyspark.sql.types.LongType(),False),
pyspark.sql.types.StructField('t',pyspark.sql.types.DoubleType(),False)
]
)
df = spark.createDataFrame(
[pyspark.sql.Row(*x) for x in d],
schema
)
def getSubgroup(ts):
result = []
total = 0
ts = sorted(ts)
tdiffs = numpy.array(ts)
tdiffs = tdiffs[1:]-tdiffs[:-1]
tdiffs = numpy.concatenate([[0],tdiffs])
subgroup = 0
for k in range(len(tdiffs)):
t = ts[k]
tdiff = tdiffs[k]
total = total+tdiff
if total >= 5:
total = 0
subgroup += 1
result.append([t,float(subgroup)])
return result
getSubgroupUDF = pyspark.sql.functions.udf(getSubgroup,pyspark.sql.types.ArrayType(pyspark.sql.types.ArrayType(pyspark.sql.types.DoubleType())))
subgroups = df.select('id','t').distinct().groupBy(
'id'
).agg(
pyspark.sql.functions.collect_list('t').alias('ts')
).withColumn(
't_and_subgroup',
pyspark.sql.functions.explode(getSubgroupUDF('ts'))
).withColumn(
't',
pyspark.sql.functions.col('t_and_subgroup').getItem(0)
).withColumn(
'subgroup',
pyspark.sql.functions.col('t_and_subgroup').getItem(1).cast(pyspark.sql.types.IntegerType())
).drop(
't_and_subgroup','ts'
)
df = df.join(
subgroups,
on=['id','t'],
how='inner'
)
df.orderBy(
pyspark.sql.functions.asc('id'),pyspark.sql.functions.asc('t')
).show()
The subgroup column is equivalent to partitioning by id, window_start so maybe you don't need to create it.
To create window_start , I think this does the job :
.withColumn("window_start", min("t").over(Window.partitionBy("id").orderBy(asc("t")).rangeBetween(0, 5)))
I'm not 100% sure about the behavior of rangeBetween.
To create offset it's just .withColumn("offset", col("t") - col("window_start"))
Let me know how it goes

Create a new column with filter

I want to create a new column that contains the count of dataframe depending on filter.
Here is an example:
+---------------------------------------+
|conditions |
+---------------------------------------+
|* |
|* |
|p1==1 AND p2==1 |
I tried:
df = df.withColumn('cardinal',df.filter(conditions).count())
it didn't work. The error message is:
"filter expression 'conditions' of type string is not a boolean.;;\nFilter conditions#2043: string\n+-
You have to use literal for your df.filter function.
Try with below syntax:
>>> df1 = df.withColumn('cardinal',lit(df.filter(conditions).count()))
Now df1 dataframe will have cardinal column added to it.
Update:
i tried with simple example:
import pyspark.sql.functions as F
df=sc.parallelize([(1,1),(2,1),(3,2)]).toDF(["p1","p2"]) #createDataFrame
conditions=((F.col('p1')==1) & (F.col('p2')==1)) #define conditions variable
df1=df.withColumn("cardinal",F.lit(df.filter(conditions).count())) #add column
df1.show(10,False)
+---+---+--------+
|p1 |p2 |cardinal|
+---+---+--------+
|1 |1 |1 |
|2 |1 |1 |
|3 |2 |1 |
+---+---+--------+
(or)
Without using conditions variable
df1=df.withColumn("cardinal",F.lit(df.filter((F.col('p1')==1) & (F.col('p2')==1)).count()))
df1.show(10,False)
+---+---+--------+
|p1 |p2 |cardinal|
+---+---+--------+
|1 |1 |1 |
|2 |1 |1 |
|3 |2 |1 |
+---+---+--------+
(or)
using .where clause
df1=df.withColumn("cardinal",F.lit(df.where((F.col("p1")==1) & (F.col("p2")==1)).count()))
df1.show(10,False)
+---+---+--------+
|p1 |p2 |cardinal|
+---+---+--------+
|1 |1 |1 |
|2 |1 |1 |
|3 |2 |1 |
+---+---+--------+

Resources