We have dataframe like below :
+------+--------------------+
| Flag | value|
+------+--------------------+
|1 |5 |
|1 |4 |
|1 |3 |
|1 |5 |
|1 |6 |
|1 |4 |
|1 |7 |
|1 |5 |
|1 |2 |
|1 |3 |
|1 |2 |
|1 |6 |
|1 |9 |
+------+--------------------+
After normal cumsum we get this.
+------+--------------------+----------+
| Flag | value|cumsum |
+------+--------------------+----------+
|1 |5 |5 |
|1 |4 |9 |
|1 |3 |12 |
|1 |5 |17 |
|1 |6 |23 |
|1 |4 |27 |
|1 |7 |34 |
|1 |5 |39 |
|1 |2 |41 |
|1 |3 |44 |
|1 |2 |46 |
|1 |6 |52 |
|1 |9 |61 |
+------+--------------------+----------+
Now what we want is for cumsum to reset when specific condition is set for ex. when it crosses 20.
Below is expected output:
+------+--------------------+----------+---------+
| Flag | value|cumsum |expected |
+------+--------------------+----------+---------+
|1 |5 |5 |5 |
|1 |4 |9 |9 |
|1 |3 |12 |12 |
|1 |5 |17 |17 |
|1 |6 |23 |23 |
|1 |4 |27 |4 | <-----reset
|1 |7 |34 |11 |
|1 |5 |39 |16 |
|1 |2 |41 |18 |
|1 |3 |44 |21 |
|1 |2 |46 |2 | <-----reset
|1 |6 |52 |8 |
|1 |9 |61 |17 |
+------+--------------------+----------+---------+
This is how we are calculating the cumulative sum.
win_counter = Window.partitionBy("flag")
df_partitioned = df_partitioned.withColumn('cumsum',F.sum(F.col('value')).over(win_counter))
There are two ways I've found to solve it without udf:
Dataframe
from pyspark.sql.window import Window
import pyspark.sql.functions as f
df = spark.createDataFrame([
(1, 5), (1, 4), (1, 3), (1, 5), (1, 6), (1, 4),
(1, 7), (1, 5), (1, 2), (1, 3), (1, 2), (1, 6), (1, 9)
], schema='Flag int, value int')
w = (Window
.partitionBy('flag')
.orderBy(f.monotonically_increasing_id())
.rowsBetween(Window.unboundedPreceding, Window.currentRow))
df = df.withColumn('values', f.collect_list('value').over(w))
expr = "AGGREGATE(values, 0, (acc, el) -> IF(acc < 20, acc + el, el))"
df = df.select('Flag', 'value', f.expr(expr).alias('cumsum'))
df.show(truncate=False)
RDD
df = spark.createDataFrame([
(1, 5), (1, 4), (1, 3), (1, 5), (1, 6), (1, 4),
(1, 7), (1, 5), (1, 2), (1, 3), (1, 2), (1, 6), (1, 9)
], schema='Flag int, value int')
def cumsum_by_flag(rows):
cumsum, reset = 0, False
for row in rows:
if reset:
cumsum = row.value
reset = False
else:
cumsum += row.value
reset = cumsum > 20
yield row.value, cumsum
def unpack(value):
flag = value[0]
value, cumsum = value[1]
return flag, value, cumsum
rdd = df.rdd.keyBy(lambda row: row.Flag)
rdd = (rdd
.groupByKey()
.flatMapValues(cumsum_by_flag)
.map(unpack))
df = rdd.toDF('Flag int, value int, cumsum int')
df.show(truncate=False)
Output:
+----+-----+------+
|Flag|value|cumsum|
+----+-----+------+
|1 |5 |5 |
|1 |4 |9 |
|1 |3 |12 |
|1 |5 |17 |
|1 |6 |23 |
|1 |4 |4 |
|1 |7 |11 |
|1 |5 |16 |
|1 |2 |18 |
|1 |3 |21 |
|1 |2 |2 |
|1 |6 |8 |
|1 |9 |17 |
+----+-----+------+
It's probably best to do with pandas_udf here.
from pyspark.sql.functions import pandas_udf, PandasUDFType
pdf = pd.DataFrame({'flag':[1]*13,'id':range(13), 'value': [5,4,3,5,6,4,7,5,2,3,2,6,9]})
df = spark.createDataFrame(pdf)
df = df.withColumn('cumsum', F.lit(math.inf))
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def _calc_cumsum(pdf):
pdf.sort_values(by=['id'], inplace=True, ascending=True)
cumsums = []
prev = None
reset = False
for v in pdf['value'].values:
if prev is None:
cumsums.append(v)
prev = v
else:
prev = prev + v if not reset else v
cumsums.append(prev)
reset = True if prev >= 20 else False
pdf['cumsum'] = cumsums
return pdf
df = df.groupby('flag').apply(_calc_cumsum)
df.show()
the results:
+----+---+-----+------+
|flag| id|value|cumsum|
+----+---+-----+------+
| 1| 0| 5| 5.0|
| 1| 1| 4| 9.0|
| 1| 2| 3| 12.0|
| 1| 3| 5| 17.0|
| 1| 4| 6| 23.0|
| 1| 5| 4| 4.0|
| 1| 6| 7| 11.0|
| 1| 7| 5| 16.0|
| 1| 8| 2| 18.0|
| 1| 9| 3| 21.0|
| 1| 10| 2| 2.0|
| 1| 11| 6| 8.0|
| 1| 12| 9| 17.0|
+----+---+-----+------+
Related
I have the following DF:
--------------------------------
|Id |Date |Value |cond |
|-------------------------------|
|1 |2022-08-03 | 100| 1 |
|1 |2022-08-04 | 200| 2 |
|1 |2022-08-05 | 150| 3 |
|1 |2022-08-06 | 300| 4 |
|1 |2022-08-07 | 400| 5 |
|1 |2022-08-08 | 150| 6 |
|1 |2022-08-09 | 500| 7 |
|1 |2022-08-10 | 150| 8 |
|1 |2022-08-11 | 150| 9 |
|1 |2022-08-12 | 700| 1 |
|1 |2022-08-13 | 800| 2 |
|1 |2022-08-14 | 150| 2 |
|1 |2022-08-15 | 300| 0 |
|1 |2022-08-16 | 200| 1 |
|1 |2022-08-17 | 150| 3 |
|1 |2022-08-18 | 150| 1 |
|1 |2022-08-19 | 250| 4 |
|1 |2022-08-20 | 150| 5 |
|1 |2022-08-21 | 400| 6 |
|2 |2022-08-03 | 100| 1 |
|2 |2022-08-04 | 200| 2 |
|2 |2022-08-05 | 150| 1 |
|2 |2022-08-06 | 300| 1 |
|2 |2022-08-07 | 400| 1 |
|2 |2022-08-08 | 150| 1 |
|2 |2022-08-09 | 125| 1 |
|2 |2022-08-10 | 150| 1 |
|2 |2022-08-11 | 150| 3 |
|2 |2022-08-12 | 170| 6 |
|2 |2022-08-13 | 150| 7 |
|2 |2022-08-14 | 150| 8 |
|2 |2022-08-15 | 300| 1 |
|2 |2022-08-16 | 150| 9 |
|2 |2022-08-17 | 150| 0 |
|2 |2022-08-18 | 400| 1 |
|2 |2022-08-19 | 150| 1 |
|2 |2022-08-20 | 500| 1 |
|2 |2022-08-21 | 150| 1 |
--------------------------------
And this one:
---------------------
|Date | cond |
|-------------------|
|2022-08-03 | 1 |
|2022-08-04 | 2 |
|2022-08-05 | 1 |
|2022-08-06 | 1 |
|2022-08-07 | 1 |
|2022-08-08 | 1 |
|2022-08-09 | 1 |
|2022-08-10 | 1 |
|2022-08-11 | 3 |
|2022-08-12 | 6 |
|2022-08-13 | 8 |
|2022-08-14 | 9 |
|2022-08-15 | 1 |
|2022-08-16 | 2 |
|2022-08-17 | 2 |
|2022-08-18 | 0 |
|2022-08-19 | 1 |
|2022-08-20 | 3 |
|2022-08-21 | 1 |
--------------------
My expected output is:
-------------------------------
|Id |Date |Avg |Count|
|-----------------------------|
|1 |2022-08-03 | 0| 0 |
|1 |2022-08-04 | 0| 0 |
|1 |2022-08-05 | 0| 0 |
|1 |2022-08-06 | 0| 0 |
|1 |2022-08-07 | 0| 0 |
|1 |2022-08-08 | 0| 0 |
|1 |2022-08-09 | 0| 0 |
|1 |2022-08-10 | 0| 0 |
|1 |2022-08-11 | 0| 0 |
|1 |2022-08-12 | 0| 0 |
|1 |2022-08-13 | 0| 0 |
|1 |2022-08-14 | 0| 0 |
|1 |2022-08-15 | 0| 0 |
|1 |2022-08-16 | 0| 0 |
|1 |2022-08-17 | 0| 0 |
|1 |2022-08-18 | 0| 0 |
|1 |2022-08-19 | 0| 0 |
|1 |2022-08-20 | 0| 0 |
|1 |2022-08-21 | 0| 0 |
|2 |2022-08-03 | 0| 0 |
|2 |2022-08-04 | 0| 0 |
|2 |2022-08-05 | 0| 1 |
|2 |2022-08-06 | 0| 2 |
|2 |2022-08-07 | 0| 3 |
|2 |2022-08-08 | 237,5| 4 |
|2 |2022-08-09 | 250| 4 |
|2 |2022-08-10 |243,75| 4 |
|2 |2022-08-11 | 0| 0 |
|2 |2022-08-12 | 0| 0 |
|2 |2022-08-13 | 0| 0 |
|2 |2022-08-14 | 0| 0 |
|2 |2022-08-15 |206,25| 4 |
|2 |2022-08-16 | 0| 0 |
|2 |2022-08-17 | 0| 0 |
|2 |2022-08-18 | 0| 0 |
|2 |2022-08-19 |243,75| 4 |
|2 |2022-08-20 | 0| 0 |
|2 |2022-08-21 | 337,5| 4 |
-------------------------------
The algorithm is:
Verify if Date and Cond are the same in the first and second DFs.
If the condition is true, I need to lookback on DF1 four days (D-1, D-2, D-3, D-4) based on Cond and calculate the Average(Avg) and count of this values. If I have more then 4 days I need to use the top 4 values to calculate the Avg and Count is going to be always 4 in this case.
Example situations based on the inputs:
Id = 1, Date = 2022-08-08
Count is 0 because the condition is false, then Avg is 0 too.
Id = 2, Date = 2022-08-08
Count is 4 because the condition is true, then I get values of 2022-08-07, 2022-08-06, 2022-08-05, 2022-08-03. I exclude 2022-08-04 because Cond value there is 2, and the Date I'm using as reference Cond is 1.
Id = 2, Date = 2022-08-07
Count is 3 because the condition is true, but I have only the 3 values before that date, so I can't calculate the Avg since I need four values, so in that case Avg is zero.
I tried to use window function, but with no success. I was able to achieve the output DF using SQL (Joins with Outter Apply). But spark doesn't have outter apply. So, my doubts are:
How to generate the output DF.
What is the best the way the generate the output DF.
MVCE to generate the input DFs in pyspark:
data_1=[
("1","2022-08-03",100,1),
("1","2022-08-04",200,2),
("1","2022-08-05",150,3),
("1","2022-08-06",300,4),
("1","2022-08-07",400,5),
("1","2022-08-08",150,6),
("1","2022-08-09",500,7),
("1","2022-08-10",150,8),
("1","2022-08-11",150,9),
("1","2022-08-12",700,1),
("1","2022-08-13",800,2),
("1","2022-08-14",150,2),
("1","2022-08-15",300,0),
("1","2022-08-16",200,1),
("1","2022-08-17",150,3),
("1","2022-08-18",150,1),
("1","2022-08-19",250,4),
("1","2022-08-20",150,5),
("1","2022-08-21",400,6),
("2","2022-08-03",100,1),
("2","2022-08-04",200,2),
("2","2022-08-05",150,1),
("2","2022-08-06",300,1),
("2","2022-08-07",400,1),
("2","2022-08-08",150,1),
("2","2022-08-09",125,1),
("2","2022-08-10",150,1),
("2","2022-08-11",150,3),
("2","2022-08-12",170,6),
("2","2022-08-13",150,7),
("2","2022-08-14",150,8),
("2","2022-08-15",300,1),
("2","2022-08-16",150,9),
("2","2022-08-17",150,0),
("2","2022-08-18",400,1),
("2","2022-08-19",150,1),
("2","2022-08-20",500,1),
("2","2022-08-21",150,1)
]
schema_1 = StructType([
StructField("Id", StringType(),True),
StructField("Date", DateType(),True),
StructField("Value", IntegerType(),True),
StructField("Cond", IntegerType(),True)
])
df_1 = spark.createDataFrame(data=data_1,schema=schema_1)
data_2 = [
("2022-08-03", 1),
("2022-08-04", 2),
("2022-08-05", 1),
("2022-08-06", 1),
("2022-08-07", 1),
("2022-08-08", 1),
("2022-08-09", 1),
("2022-08-10", 1),
("2022-08-11", 3),
("2022-08-12", 6),
("2022-08-13", 8),
("2022-08-14", 9),
("2022-08-15", 1),
("2022-08-16", 2),
("2022-08-17", 2),
("2022-08-18", 0),
("2022-08-19", 1),
("2022-08-20", 3),
("2022-08-21", 1)
]
schema_2 = StructType([
StructField("Date", DateType(),True),
StructField("Cond", IntegerType(),True)
])
df_2 = spark.createDataFrame(data=data_2,schema=schema_2)
UPDATE: I updated the question to be more clearly about the conditions to join the DFs!
Do a left join to get the dates you are interested in.
Then use pyspark.sql.window to get the values you need into a list and take size of this as Count.
Finally with the help of pyspark.sql.functions.aggregate get the Avg.
from pyspark.sql import functions as F, Window
# cast to date, and rename columns for later use
df_1 = df_1.withColumn("Date", F.col("Date").cast("date"))
df_2 = df_2.withColumn("Date", F.col("Date").cast("date"))
df_2 = df_2.withColumnRenamed("Date", "DateDf2")\
.withColumnRenamed("Cond", "CondDf2")
# left join
df = df_1.join(df_2, (df_1.Cond==df_2.CondDf2)&(df_1.Date==df_2.DateDf2), how='left')
windowSpec = Window.partitionBy("Id", "Cond").orderBy("Date")
# all the magic happens here!
df = (
# only start counting when "DateDf2" is not null, and put the values into a list
df.withColumn("value_list", F.when(F.isnull("DateDf2"), F.array()).otherwise(F.collect_list("Value").over(windowSpec.rowsBetween(-4, -1))))
.withColumn("Count", F.size("value_list"))
# use aggregate to sum up the list only if the size is 4! and divide by 4 to get average
.withColumn("Avg", F.when(F.col("count")==4, F.aggregate("value_list", F.lit(0), lambda acc,x: acc+x)/4).otherwise(F.lit(0)))
.select("Id", "Date", "Avg", "Count")
.orderBy("Id", "Date")
)
Output is:
+---+----------+------+-----+
|Id |Date |Avg |Count|
+---+----------+------+-----+
|1 |2022-08-03|0.0 |0 |
|1 |2022-08-04|0.0 |0 |
|1 |2022-08-05|0.0 |0 |
|1 |2022-08-06|0.0 |0 |
|1 |2022-08-07|0.0 |0 |
|1 |2022-08-08|0.0 |0 |
|1 |2022-08-09|0.0 |0 |
|1 |2022-08-10|0.0 |0 |
|1 |2022-08-11|0.0 |0 |
|1 |2022-08-12|0.0 |0 |
|1 |2022-08-13|0.0 |0 |
|1 |2022-08-14|0.0 |0 |
|1 |2022-08-15|0.0 |0 |
|1 |2022-08-16|0.0 |0 |
|1 |2022-08-17|0.0 |0 |
|1 |2022-08-18|0.0 |0 |
|1 |2022-08-19|0.0 |0 |
|1 |2022-08-20|0.0 |0 |
|1 |2022-08-21|0.0 |0 |
|2 |2022-08-03|0.0 |0 |
|2 |2022-08-04|0.0 |0 |
|2 |2022-08-05|0.0 |1 |
|2 |2022-08-06|0.0 |2 |
|2 |2022-08-07|0.0 |3 |
|2 |2022-08-08|237.5 |4 |
|2 |2022-08-09|250.0 |4 |
|2 |2022-08-10|243.75|4 |
|2 |2022-08-11|0.0 |0 |
|2 |2022-08-12|0.0 |0 |
|2 |2022-08-13|0.0 |0 |
|2 |2022-08-14|0.0 |0 |
|2 |2022-08-15|206.25|4 |
|2 |2022-08-16|0.0 |0 |
|2 |2022-08-17|0.0 |0 |
|2 |2022-08-18|0.0 |0 |
|2 |2022-08-19|243.75|4 |
|2 |2022-08-20|0.0 |0 |
|2 |2022-08-21|337.5 |4 |
+---+----------+------+-----+
here is the solution for the same
Solution:
from pyspark.sql import Window
import pyspark.sql.functions as F
df_1= df_1.withColumn("Date",F.col("Date").cast("timestamp"))
df_2= df_2.withColumn("Date",F.col("Date").cast("timestamp"))
window_spec = Window.partitionBy(["Id"]).orderBy("Date")
four_days_sld_wnd_exl_cuurent_row = Window.partitionBy(["Id"]).orderBy(["rnk"]).rangeBetween(-4, -1)
window_spec_count_cond_ = Window.partitionBy(["Id"]).orderBy(F.unix_timestamp("Date", 'yyyy-MM-dd') / 86400).rangeBetween(-4, -1)
agg_col_cond_ = (F.col("agg") ==0.0)
date_2_col_cond_ = (F.col("Date_2").isNull())
valid_4_days_agg_value =(F.when((~date_2_col_cond_) & (F.size(F.col("date_arrays_with_cond_1"))==4),
F.sum(F.col("Value")).over(four_days_sld_wnd_exl_cuurent_row)).otherwise(F.lit(0.0)))
count_cond_ = (F.when(~agg_col_cond_ & ~date_2_col_cond_,F.lit(4))
.when(agg_col_cond_ & date_2_col_cond_,F.lit(0))
.otherwise(F.size(F.collect_set(F.col("Date_2")).over(window_spec_count_cond_))))
df_jn = df_1.join(df_2,["Date","Cond"],"left")\
.select(df_1["*"],df_2["Date"].alias("Date_2")).orderBy("Id",df_1["Date"])
filter_having_cond_1=(F.col("Cond") == 1)
cond_columns_matching = (F.col("Date_2").isNull())
df_fnl_with_cond_val_1 = df_jn.filter(filter_having_cond_1)
df_fnl_with_cond_val_other=df_jn.filter(~filter_having_cond_1)\
.withColumn("agg",F.lit(0.0))\
.withColumn("count",F.lit(0))\
.drop("Date_2")
df_fnl_with_cond_val_1 = df_fnl_with_cond_val_1\
.withColumn("rnk",F.row_number().over(window_spec))\
.withColumn("date_arrays_with_cond_1", F.collect_set(F.col("Date")).over(four_days_sld_wnd_exl_cuurent_row))\
.withColumn("agg",valid_4_days_agg_value/4)\
.withColumn("count",count_cond_)\
.drop("date_arrays_with_cond_1","rnk","Date_2")
df_fnl = df_fnl_with_cond_val_1.unionByName(df_fnl_with_cond_val_other)
df_fnl.orderBy(["id","Date"]).show(50,0)
kindly upvote if you like my solution .
output
+---+-------------------+-----+----+------+-----+
|Id |Date |Value|Cond|agg |count|
+---+-------------------+-----+----+------+-----+
|1 |2022-08-03 00:00:00|100 |1 |0.0 |0 |
|1 |2022-08-04 00:00:00|200 |2 |0.0 |0 |
|1 |2022-08-05 00:00:00|150 |3 |0.0 |0 |
|1 |2022-08-06 00:00:00|300 |4 |0.0 |0 |
|1 |2022-08-07 00:00:00|400 |5 |0.0 |0 |
|1 |2022-08-08 00:00:00|150 |6 |0.0 |0 |
|1 |2022-08-09 00:00:00|500 |7 |0.0 |0 |
|1 |2022-08-10 00:00:00|150 |8 |0.0 |0 |
|1 |2022-08-11 00:00:00|150 |9 |0.0 |0 |
|1 |2022-08-12 00:00:00|700 |1 |0.0 |0 |
|1 |2022-08-13 00:00:00|800 |2 |0.0 |0 |
|1 |2022-08-14 00:00:00|150 |2 |0.0 |0 |
|1 |2022-08-15 00:00:00|300 |0 |0.0 |0 |
|1 |2022-08-16 00:00:00|200 |1 |0.0 |0 |
|1 |2022-08-17 00:00:00|150 |3 |0.0 |0 |
|1 |2022-08-18 00:00:00|150 |1 |0.0 |0 |
|1 |2022-08-19 00:00:00|250 |4 |0.0 |0 |
|1 |2022-08-20 00:00:00|150 |5 |0.0 |0 |
|1 |2022-08-21 00:00:00|400 |6 |0.0 |0 |
|2 |2022-08-03 00:00:00|100 |1 |0.0 |0 |
|2 |2022-08-04 00:00:00|200 |2 |0.0 |0 |
|2 |2022-08-05 00:00:00|150 |1 |0.0 |1 |
|2 |2022-08-06 00:00:00|300 |1 |0.0 |2 |
|2 |2022-08-07 00:00:00|400 |1 |0.0 |3 |
|2 |2022-08-08 00:00:00|150 |1 |237.5 |4 |
|2 |2022-08-09 00:00:00|125 |1 |250.0 |4 |
|2 |2022-08-10 00:00:00|150 |1 |243.75|4 |
|2 |2022-08-11 00:00:00|150 |3 |0.0 |0 |
|2 |2022-08-12 00:00:00|170 |6 |0.0 |0 |
|2 |2022-08-13 00:00:00|150 |7 |0.0 |0 |
|2 |2022-08-14 00:00:00|150 |8 |0.0 |0 |
|2 |2022-08-15 00:00:00|300 |1 |206.25|4 |
|2 |2022-08-16 00:00:00|150 |9 |0.0 |0 |
|2 |2022-08-17 00:00:00|150 |0 |0.0 |0 |
|2 |2022-08-18 00:00:00|400 |1 |0.0 |0 |
|2 |2022-08-19 00:00:00|150 |1 |243.75|4 |
|2 |2022-08-20 00:00:00|500 |1 |0.0 |0 |
|2 |2022-08-21 00:00:00|150 |1 |337.5 |4 |
+---+-------------------+-----+----+------+-----+
Here is my test data
test = spark.createDataFrame([
("2018-06-03",2, 4, 4 ),
("2018-06-04",4, 3, 3 ),
( "2018-06-03",8, 1, 1),
("2018-06-01",3, 1, 1),
( "2018-06-05", 3, 2, 0),
])\
.toDF( "transactiondate", "SalesA", "SalesB","SalesC")
test.show()
I would like to add a row-wise total column and % of the total column corresponding to each sales category (A, B and C)
Desired Output:
+---------------+------+------+------+----------+------+------+------+
|transactiondate|SalesA|SalesB|SalesC|TotalSales|Perc_A|Perc_B|Perc_C|
+---------------+------+------+------+----------+------+------+------+
| 2018-06-03| 2| 4| 4| 10| 0.2| 0.4| 0.4|
| 2018-06-04| 4| 3| 3| 10| 0.4| 0.3| 0.3|
| 2018-06-03| 8| 1| 1| 10| 0.8| 0.1| 0.1|
| 2018-06-01| 3| 1| 1| 5| 0.6| 0.2| 0.2|
| 2018-06-05| 3| 2| 0| 5| 0.6| 0.4| 0.0|
+---------------+------+------+------+----------+------+------+------+
How can I do it in pyspark?
Edit: I want the code to be adaptable even if I add more items, i.e. if I have one more column salesD, code should create total and percentage columns. (i.e. columns shouldn't be hardcoded)
You can use selectExpr and do simple arithmetic SQL operations for each added columns
test = test.selectExpr("*",
"SalesA+SalesB+SalesC as TotalSales",
"SalesA/(SalesA+SalesB+SalesC) as Perc_A",
"SalesB/(SalesA+SalesB+SalesC) as Perc_B",
"SalesC/(SalesA+SalesB+SalesC) as Perc_C"
)
or use a more flexible solution
from pyspark.sql.functions import col, expr
# columns to be included in TotalSales calculation
cols = ['SalesA', 'SalesB', 'SalesC']
test = (test
.withColumn('TotalSales', expr('+'.join(cols)))
.select(col('*'),
*[expr('{0}/TotalSales {1}'.format(c,'Perc_'+c)) for c in cols]))
One option is to use several withColumn statements
import pyspark.sql.functions as F
test\
.withColumn('TotalSales', F.col('SalesA') + F.col('SalesB') + F.col('SalesC'))\
.withColumn('Perc_A', F.col('SalesA') / F.col('TotalSales'))\
.withColumn('Perc_B', F.col('SalesB') / F.col('TotalSales'))\
.withColumn('Perc_C', F.col('SalesC') / F.col('TotalSales'))
Try this spark-sql solution
test.createOrReplaceTempView("sales_table")
sales=[ x for x in test.columns if x.upper().startswith("SALES") ]
sales2="+".join(sales)
print(str(sales)) # ['SalesA', 'SalesB', 'SalesC']
per_sales=[ x +"/TotalSales as " + "Perc_" +x for x in sales ]
per_sales2=",".join(per_sales)
print(str(per_sales)) # ['SalesA/TotalSales as Perc_SalesA', 'SalesB/TotalSales as Perc_SalesB', 'SalesC/TotalSales as Perc_SalesC']
spark.sql(f"""
with t1 ( select *, {sales2} TotalSales from sales_table )
select *, {per_sales2} from t1
""").show(truncate=False)
+---------------+------+------+------+----------+-----------+-----------+-----------+
|transactiondate|SalesA|SalesB|SalesC|TotalSales|Perc_SalesA|Perc_SalesB|Perc_SalesC|
+---------------+------+------+------+----------+-----------+-----------+-----------+
|2018-06-03 |2 |4 |4 |10 |0.2 |0.4 |0.4 |
|2018-06-04 |4 |3 |3 |10 |0.4 |0.3 |0.3 |
|2018-06-03 |8 |1 |1 |10 |0.8 |0.1 |0.1 |
|2018-06-01 |3 |1 |1 |5 |0.6 |0.2 |0.2 |
|2018-06-05 |3 |2 |0 |5 |0.6 |0.4 |0.0 |
+---------------+------+------+------+----------+-----------+-----------+-----------+
You can also use the aggregate() higher order function to sum the sales* columns. But for this the columns must be of Integer/double type, not long.
test2=test.withColumn("SalesA",expr("cast(salesa as int)"))\
.withColumn("SalesB",expr("cast(salesb as int)"))\
.withColumn("SalesC",expr("cast(salesc as int)"))
test2.createOrReplaceTempView("sales_table2")
sales3=",".join(sales) # just join the sales columns with comma
spark.sql(f"""
with t1 ( select *, aggregate(array({sales3}),0,(acc,x) -> acc+x) TotalSales from sales_table2 )
select *, {per_sales2} from t1
""").show(truncate=False)
+---------------+------+------+------+----------+-----------+-----------+-----------+
|transactiondate|SalesA|SalesB|SalesC|TotalSales|Perc_SalesA|Perc_SalesB|Perc_SalesC|
+---------------+------+------+------+----------+-----------+-----------+-----------+
|2018-06-03 |2 |4 |4 |10 |0.2 |0.4 |0.4 |
|2018-06-04 |4 |3 |3 |10 |0.4 |0.3 |0.3 |
|2018-06-03 |8 |1 |1 |10 |0.8 |0.1 |0.1 |
|2018-06-01 |3 |1 |1 |5 |0.6 |0.2 |0.2 |
|2018-06-05 |3 |2 |0 |5 |0.6 |0.4 |0.0 |
+---------------+------+------+------+----------+-----------+-----------+-----------+
Current DF (filter by a single userId, flag is 1 when the loss is > 0, -1 when is <=0):
display(df):
+------+----------+---------+----+
| user|Date |RealLoss |flag|
+------+----------+---------+----+
|100364|2019-02-01| -16.5| 1|
|100364|2019-02-02| 73.5| -1|
|100364|2019-02-03| 31| -1|
|100364|2019-02-09| -5.2| 1|
|100364|2019-02-10| -34.5| 1|
|100364|2019-02-13| -8.1| 1|
|100364|2019-02-18| 5.68| -1|
|100364|2019-02-19| 5.76| -1|
|100364|2019-02-20| 9.12| -1|
|100364|2019-02-26| 9.4| -1|
|100364|2019-02-27| -30.6| 1|
+----------+------+---------+----+
the desidered outcome df should show the number of days since lastwin ('RecencyLastWin') and since lastloss ('RecencyLastLoss')
display(df):
+------+----------+---------+----+--------------+---------------+
| user|Date |RealLoss |flag|RecencyLastWin|RecencyLastLoss|
+------+----------+---------+----+--------------+---------------+
|100364|2019-02-01| -16.5| 1| null| null|
|100364|2019-02-02| 73.5| -1| 1| null|
|100364|2019-02-03| 31| -1| 2| 1|
|100364|2019-02-09| -5.2| 1| 8| 6|
|100364|2019-02-10| -34.5| 1| 1| 7|
|100364|2019-02-13| -8.1| 1| 1| 10|
|100364|2019-02-18| 5.68| -1| 5| 15|
|100364|2019-02-19| 5.76| -1| 6| 1|
|100364|2019-02-20| 9.12| -1| 7| 1|
|100364|2019-02-26| 9.4| -1| 13| 6|
|100364|2019-02-27| -30.6| 1| 14| 1|
+----------+------+---------+----+--------------+---------------+
My approach was the following:
from pyspark.sql.window import Window
w = Window.partitionBy("userId", 'PlayerSiteCode').orderBy("EventDate")
last_positive = check.filter('flag = "1"').withColumn('last_positive_day' , F.lag('EventDate').over(w))
last_negative = check.filter('flag = "-1"').withColumn('last_negative_day' , F.lag('EventDate').over(w))
finalcheck = check.join(last_positive.select('userId', 'PlayerSiteCode', 'EventDate', 'last_positive_day'), ['userId', 'PlayerSiteCode', 'EventDate'], how = 'left')\
.join(last_negative.select('userId', 'PlayerSiteCode', 'EventDate', 'last_negative_day'), ['userId', 'PlayerSiteCode', 'EventDate'], how = 'left')\
.withColumn('previous_date_played' , F.lag('EventDate').over(w))\
.withColumn('last_positive_day_count', F.datediff(F.col('EventDate'), F.col('last_positive_day')))\
.withColumn('last_negative_day_count', F.datediff(F.col('EventDate'), F.col('last_negative_day')))
then I tried to add (multiple attempts..) but failed to 'perfectly' return what I want.
finalcheck = finalcheck.withColumn('previous_last_pos' , F.last('last_positive_day_count', True).over(w2))\
.withColumn('previous_last_neg' , F.last('last_negative_day_count', True).over(w2))\
.withColumn('previous_last_pos_date' , F.last('last_positive_day', True).over(w2))\
.withColumn('previous_last_neg_date' , F.last('last_negative_day', True).over(w2))\
.withColumn('recency_last_positive' , F.datediff(F.col('EventDate'), F.col('previous_last_pos_date')))\
.withColumn('day_since_last_negative_v1' , F.datediff(F.col('EventDate'), F.col('previous_last_neg_date')))\
.withColumn('days_off' , F.datediff(F.col('EventDate'), F.col('previous_date_played')))\
.withColumn('recency_last_negative' , F.when((F.col('day_since_last_negative_v1').isNull()), F.col('days_off')).otherwise(F.col('day_since_last_negative_v1')))\
.withColumn('recency_last_negative_v2' , F.when((F.col('last_negative_day').isNull()), F.col('days_off')).otherwise(F.col('day_since_last_negative_v1')))\
.withColumn('recency_last_positive_v2' , F.when((F.col('last_positive_day').isNull()), F.col('days_off')).otherwise(F.col('recency_last_positive')))
Any suggestion/tips?
(I found a similar question but didn't figured out how to apply in my specific case):
How to calculate days between when last condition was met?
Here is my try.
There are two parts to calculate this. The first one is that when the wins and losses keep going, then the difference of dates should be summed. To achieve this, I have marked the consecutive losses and wins as 1, and split them into the partition groups by cumulative summing until the current row of the marker. Then, I can calculate the cumulative days from the last loss or win after the consecutive losses and wins the end.
The second one is that when the wins and losses changed, simply get the date difference from the last match and this match. It can be easily obtained by the date difference of current and previous one.
Finally, merge those results in a column.
from pyspark.sql.functions import lag, col, sum
from pyspark.sql import Window
w1 = Window.orderBy('Date')
w2 = Window.partitionBy('groupLossCheck').orderBy('Date')
w3 = Window.partitionBy('groupWinCheck').orderBy('Date')
df2 = df.withColumn('lastFlag', lag('flag', 1).over(w1)) \
.withColumn('lastDate', lag('Date', 1).over(w1)) \
.withColumn('dateDiff', expr('datediff(Date, lastDate)')) \
.withColumn('consecutiveLoss', expr('if(flag = 1 or lastFlag = 1, 0, 1)')) \
.withColumn('consecutiveWin' , expr('if(flag = -1 or lastFlag = -1, 0, 1)')) \
.withColumn('groupLossCheck', sum('consecutiveLoss').over(w1)) \
.withColumn('groupWinCheck' , sum('consecutiveWin' ).over(w1)) \
.withColumn('daysLastLoss', sum(when((col('consecutiveLoss') == 0) & (col('groupLossCheck') != 0), col('dateDiff'))).over(w2)) \
.withColumn('daysLastwin' , sum(when((col('consecutiveWin' ) == 0) & (col('groupWinCheck' ) != 0), col('dateDiff'))).over(w3)) \
.withColumn('lastLoss', expr('if(lastFlag = -1, datediff, null)')) \
.withColumn('lastWin' , expr('if(lastFlag = 1, dateDiff, null)')) \
.withColumn('RecencyLastLoss', coalesce('lastLoss', 'daysLastLoss')) \
.withColumn('RecencyLastWin', coalesce('lastWin' , 'daysLastwin' )) \
.orderBy('Date')
df2.show(11, False)
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
|user |Date |RealLoss|flag|lastFlag|lastDate |dateDiff|consecutiveLoss|consecutiveWin|groupLossCheck|groupWinCheck|daysLastLoss|daysLastwin|lastLoss|lastWin|RecencyLastLoss|RecencyLastWin|
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
|100364|2019-02-01|-16.5 |1 |null |null |null |0 |1 |0 |1 |null |null |null |null |null |null |
|100364|2019-02-02|73.5 |-1 |1 |2019-02-01|1 |0 |0 |0 |1 |null |1 |null |1 |null |1 |
|100364|2019-02-03|31.0 |-1 |-1 |2019-02-02|1 |1 |0 |1 |1 |null |2 |1 |null |1 |2 |
|100364|2019-02-09|-5.2 |1 |-1 |2019-02-03|6 |0 |0 |1 |1 |6 |8 |6 |null |6 |8 |
|100364|2019-02-10|-34.5 |1 |1 |2019-02-09|1 |0 |1 |1 |2 |7 |null |null |1 |7 |1 |
|100364|2019-02-13|-8.1 |1 |1 |2019-02-10|3 |0 |1 |1 |3 |10 |null |null |3 |10 |3 |
|100364|2019-02-18|5.68 |-1 |1 |2019-02-13|5 |0 |0 |1 |3 |15 |5 |null |5 |15 |5 |
|100364|2019-02-19|5.76 |-1 |-1 |2019-02-18|1 |1 |0 |2 |3 |null |6 |1 |null |1 |6 |
|100364|2019-02-20|9.12 |-1 |-1 |2019-02-19|1 |1 |0 |3 |3 |null |7 |1 |null |1 |7 |
|100364|2019-02-26|9.4 |-1 |-1 |2019-02-20|6 |1 |0 |4 |3 |null |13 |6 |null |6 |13 |
|100364|2019-02-27|-30.6 |1 |-1 |2019-02-26|1 |0 |0 |4 |3 |1 |14 |1 |null |1 |14 |
+------+----------+--------+----+--------+----------+--------+---------------+--------------+--------------+-------------+------------+-----------+--------+-------+---------------+--------------+
df2.select(*df.columns, 'RecencyLastLoss', 'RecencyLastWin').show(11, False)
+------+----------+--------+----+---------------+--------------+
|user |Date |RealLoss|flag|RecencyLastLoss|RecencyLastWin|
+------+----------+--------+----+---------------+--------------+
|100364|2019-02-01|-16.5 |1 |null |null |
|100364|2019-02-02|73.5 |-1 |null |1 |
|100364|2019-02-03|31.0 |-1 |1 |2 |
|100364|2019-02-09|-5.2 |1 |6 |8 |
|100364|2019-02-10|-34.5 |1 |7 |1 |
|100364|2019-02-13|-8.1 |1 |10 |3 |
|100364|2019-02-18|5.68 |-1 |15 |5 |
|100364|2019-02-19|5.76 |-1 |1 |6 |
|100364|2019-02-20|9.12 |-1 |1 |7 |
|100364|2019-02-26|9.4 |-1 |6 |13 |
|100364|2019-02-27|-30.6 |1 |1 |14 |
+------+----------+--------+----+---------------+--------------+
I am creating dataframe as per given schema, after that i want to create new dataframe by reordering the existing dataframe.
Can it be possible the re-ordering of columns in spark dataframe?
object Demo extends Context {
def main(args: Array[String]): Unit = {
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import sparkSession.sqlContext.implicits._
val empDF = emp.toDF(empColumns: _*)
empDF.show(false)
}
}
Current DF:
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
I want output as this following df, where gender and salary column re-ordered
New DF:
+------+--------+------+------+---------------+-----------+-----------+
|emp_id|name |gender|salary|superior_emp_id|year_joined|emp_dept_id|
+------+--------+------+------+---------------+-----------+-----------+
|1 |Smith |M |3000 |-1 |2018 |10 |
|2 |Rose |M |4000 |1 |2010 |20 |
|3 |Williams|M |1000 |1 |2010 |10 |
|4 |Jones |F |2000 |2 |2005 |10 |
|5 |Brown | |-1 |2 |2010 |40 |
|6 |Brown | |-1 |2 |2010 |50 |
+------+--------+------+------+---------------+-----------+-----------+
Just use select() to re-order the columns:
df = df.select('emp_id','name','gender','salary','superior_emp_id','year_joined','emp_dept_id')
It will be shown according to your ordering in select() argument.
Scala way of doing it
//Order the column names as you want
val columns = Array("emp_id","name","gender","salary","superior_emp_id","year_joined","emp_dept_id")
.map(col)
//Pass it to select
df.select(columns: _*)
I have pyspark.rdd.PipelinedRDD (Rdd1).
when I am doing Rdd1.collect(),it is giving result like below.
[(10, {3: 3.616726727464709, 4: 2.9996439803387602, 5: 1.6767412921625855}),
(1, {3: 2.016527311459324, 4: -1.5271512313750577, 5: 1.9665475696370045}),
(2, {3: 6.230272144805092, 4: 4.033642544526678, 5: 3.1517805604906313}),
(3, {3: -0.3924680103722977, 4: 2.9757316477407443, 5: -1.5689126834176417})]
Now I want to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method
My final data frame should be like below. df.show() should be like:
+----------+-------+-------------------+
|CId |IID |Score |
+----------+-------+-------------------+
|10 |4 |2.9996439803387602 |
|10 |5 |1.6767412921625855 |
|10 |3 |3.616726727464709 |
|1 |4 |-1.5271512313750577|
|1 |5 |1.9665475696370045 |
|1 |3 |2.016527311459324 |
|2 |4 |4.033642544526678 |
|2 |5 |3.1517805604906313 |
|2 |3 |6.230272144805092 |
|3 |4 |2.9757316477407443 |
|3 |5 |-1.5689126834176417|
|3 |3 |-0.3924680103722977|
+----------+-------+-------------------+
I can achieve this converting to rdd next applying collect, iteration and finally Data frame.
but now I want to convert pyspark.rdd.PipelinedRDD to Dataframe with out using any collect() method.
please let me know how to achieve this?
You want to do two things here:
1. flatten your data
2. put it into a dataframe
One way to do it is as follows:
First, let us flatten the dictionary:
rdd2 = Rdd1.flatMapValues(lambda x : [ (k, x[k]) for k in x.keys()])
When collecting the data, you get something like this:
[(10, (3, 3.616726727464709)), (10, (4, 2.9996439803387602)), ...
Then we can format the data and turn it into a dataframe:
rdd2.map(lambda x : (x[0], x[1][0], x[1][1]))\
.toDF(("CId", "IID", "Score"))\
.show()
which gives you this:
+---+---+-------------------+
|CId|IID| Score|
+---+---+-------------------+
| 10| 3| 3.616726727464709|
| 10| 4| 2.9996439803387602|
| 10| 5| 1.6767412921625855|
| 1| 3| 2.016527311459324|
| 1| 4|-1.5271512313750577|
| 1| 5| 1.9665475696370045|
| 2| 3| 6.230272144805092|
| 2| 4| 4.033642544526678|
| 2| 5| 3.1517805604906313|
| 3| 3|-0.3924680103722977|
| 3| 4| 2.9757316477407443|
| 3| 5|-1.5689126834176417|
+---+---+-------------------+
There is an even easier and more elegant solution avoiding python lambda-expressions as in #oli answer which relies on spark DataFrames's explode which perfectly fits your requirement. It should be faster too because there is no need to use python lambda's twice. See below:
from pyspark.sql.functions import explode
# dummy data
data = [(10, {3: 3.616726727464709, 4: 2.9996439803387602, 5: 1.6767412921625855}),
(1, {3: 2.016527311459324, 4: -1.5271512313750577, 5: 1.9665475696370045}),
(2, {3: 6.230272144805092, 4: 4.033642544526678, 5: 3.1517805604906313}),
(3, {3: -0.3924680103722977, 4: 2.9757316477407443, 5: -1.5689126834176417})]
# create your rdd
rdd = sc.parallelize(data)
# convert to spark data frame
df = rdd.toDF(["CId", "Values"])
# use explode
df.select("CId", explode("Values").alias("IID", "Score")).show()
+---+---+-------------------+
|CId|IID| Score|
+---+---+-------------------+
| 10| 3| 3.616726727464709|
| 10| 4| 2.9996439803387602|
| 10| 5| 1.6767412921625855|
| 1| 3| 2.016527311459324|
| 1| 4|-1.5271512313750577|
| 1| 5| 1.9665475696370045|
| 2| 3| 6.230272144805092|
| 2| 4| 4.033642544526678|
| 2| 5| 3.1517805604906313|
| 3| 3|-0.3924680103722977|
| 3| 4| 2.9757316477407443|
| 3| 5|-1.5689126834176417|
+---+---+-------------------+
This is how you can do it with scala
val Rdd1 = spark.sparkContext.parallelize(Seq(
(10, Map(3 -> 3.616726727464709, 4 -> 2.9996439803387602, 5 -> 1.6767412921625855)),
(1, Map(3 -> 2.016527311459324, 4 -> -1.5271512313750577, 5 -> 1.9665475696370045)),
(2, Map(3 -> 6.230272144805092, 4 -> 4.033642544526678, 5 -> 3.1517805604906313)),
(3, Map(3 -> -0.3924680103722977, 4 -> 2.9757316477407443, 5 -> -1.5689126834176417))
))
val x = Rdd1.flatMap(x => (x._2.map(y => (x._1, y._1, y._2))))
.toDF("CId", "IId", "score")
Output:
+---+---+-------------------+
|CId|IId|score |
+---+---+-------------------+
|10 |3 |3.616726727464709 |
|10 |4 |2.9996439803387602 |
|10 |5 |1.6767412921625855 |
|1 |3 |2.016527311459324 |
|1 |4 |-1.5271512313750577|
|1 |5 |1.9665475696370045 |
|2 |3 |6.230272144805092 |
|2 |4 |4.033642544526678 |
|2 |5 |3.1517805604906313 |
|3 |3 |-0.3924680103722977|
|3 |4 |2.9757316477407443 |
|3 |5 |-1.5689126834176417|
+---+---+-------------------+
Hope you can convert to pyspark.
Ensure a spark session is created first:
sc = SparkContext()
spark = SparkSession(sc)
I found this answer when I was trying to solve this exact issue.
'PipelinedRDD' object has no attribute 'toDF' in PySpark