I have a data frame
| Id | Date | Value |
| 1 | 1/1/2019 | 11 |
| 1 | 1/2/2019 | 12 |
| 1 | 1/3/2019 | 13 |
| 1 | 1/5/2019 | 14 |
| 1 | 1/6/2019 | 15 |
I want to calculate the sum of last 2 values by date:
| Id | Date | Value | Sum |
| 1 | 1/1/2019 | 11 | null |
| 1 | 1/2/2019 | 12 | null |
| 1 | 1/3/2019 | 13 | 23 |
| 1 | 1/5/2019 | 14 | -13 | // there is no 1/4 so 0 - 13
| 1 | 1/6/2019 | 15 | 14 | // there is no 1/4 so 14 - 0
Right now I have
let window = Window
.PartitionBy("Id")
.OrderBy(Functions.Col("Date").Cast("timestamp").Cast("long"))
data.WithColumn("Sum", Functions.Lag("Value", 1).Over(window) - Functions.Lag("Value", 2).Over(window))
With this approach I can assume that the missed value is equal to previous one (so 1/4 is equal to 1/3 = 13).
How can I consider 1/4 as zero?
You got two ways to do this.
One would be to use lagfunction with when and otherwise and use the api data to remove one day from date.
The pros is this is working fine and quickly, the cons is that each time you to change your lag formula, you have to rewrite it...
However, I found a more generalizable method. The idea will be to fill the missing date using the Timestamp to Long and use spark.range to generate every possible date between minDate and maxDate
// Some imports
import org.apache.spark.sql.{functions => F}
import org.apache.spark.sql.expressions.Window
// Our DF
val df = Seq(
(1, "1/1/2019", 11),
(1, "1/2/2019", 12),
(1, "1/3/2019", 13),
(1, "1/5/2019", 14),
(1, "1/6/2019", 15)
).
toDF("id", "date", "value").
withColumn("date", F.to_timestamp($"date", "MM/dd/yyyy"))
// min and max date
val (mindate, maxdate) = df.select(min($"date"), max($"date")).as[(Long, Long)].first
// Our step in seconds, so one day here
val step: Long = 24 * 60 * 60
// Generate missing dates
val reference = spark.
range(mindate, ((maxdate / step) + 1) * step, step).
select($"id".cast("timestamp").as("date"))
// Our df filled !
val filledDf = reference.join(df, Seq("date"), "leftouter").na.fill(0, Seq("value"))
/**
+-------------------+----+-----+
| date| id|value|
+-------------------+----+-----+
|2019-01-01 00:00:00| 1| 11|
|2019-01-02 00:00:00| 1| 12|
|2019-01-03 00:00:00| 1| 13|
|2019-01-04 00:00:00|null| 0|
|2019-01-05 00:00:00| 1| 14|
|2019-01-06 00:00:00| 1| 15|
+-------------------+----+-----+
*/
filledDf.
withColumn("result", F.lag($"value", 1, 0).over(windowSpec) - F.lag($"value", 2, 0).over(windowSpec)).show
/**
+-------------------+----+-----+------+
| date| id|value|result|
+-------------------+----+-----+------+
|2019-01-01 00:00:00| 1| 11| 0|
|2019-01-02 00:00:00| 1| 12| 11|
|2019-01-03 00:00:00| 1| 13| 1|
|2019-01-04 00:00:00|null| 0| 1|
|2019-01-05 00:00:00| 1| 14| -13|
|2019-01-06 00:00:00| 1| 15| 14|
+-------------------+----+-----+------+
*/
Related
I want to convert multiple numeric columns of PySpark dataframe into its percentile values using PySpark, without changing its order.
E.g. given an array of column names arr = [Salary, Age, Bonus] to convert columns into percentiles.
Input
+----------+-------------+---------+--------+-----+-------+
| Empl. No | Dept | Pincode | Salary | Age | Bonus |
+----------+-------------+---------+--------+-----+-------+
| 1 | HR | 111 | 1000 | 45 | 100 |
| 2 | Sales | 596 | 500 | 30 | 50 |
| 3 | Manufacture | 895 | 600 | 50 | 400 |
| 4 | HR | 212 | 700 | 26 | 60 |
| 5 | Business | 754 | 350 | 18 | 22 |
+----------+-------------+---------+--------+-----+-------+
Expected output
+----------+-------------+---------+--------+-----+-------+
| Empl. No | Dept | Pincode | Salary | Age | Bonus |
+----------+-------------+---------+--------+-----+-------+
| 1 | HR | 111 | 100 | 80 | 80 |
| 2 | Sales | 596 | 40 | 60 | 40 |
| 3 | Manufacture | 895 | 60 | 100 | 100 |
| 4 | HR | 212 | 80 | 40 | 60 |
| 5 | Business | 754 | 20 | 20 | 20 |
+----------+-------------+---------+--------+-----+-------+
The formula for percentile for a given element 'x' in the list = (Number of elements less than 'x'/Total number of elements) *100.
You can use percentile_approx for this , in conjunction with groupBy with the desired columns for which you want the percentile to be calculated.
Built in Spark > 3.x
input_list = [
(1,"HR",111,1000,45,100)
,(2,"Sales",112,500,30,50)
,(3,"Manufacture",127,600,50,500)
,(4,"Hr",821,700,26,60)
,(5,"Business",754,350,18,22)
]
sparkDF = sql.createDataFrame(input_list,['emp_no','dept','pincode','salary','age','bonus'])
sparkDF.groupBy(['emp_no','dept']).agg(
*[ F.first(F.col('pincode')).alias('pincode') ]
,*[ F.percentile_approx(F.col(col),0.95).alias(col) for col in ['salary','age','bonus'] ]
).show()
+------+-----------+-------+------+---+-----+
|emp_no| dept|pincode|salary|age|bonus|
+------+-----------+-------+------+---+-----+
| 3|Manufacture| 127| 600| 50| 500|
| 1| HR| 111| 1000| 45| 100|
| 2| Sales| 112| 500| 30| 50|
| 5| Business| 754| 350| 18| 22|
| 4| Hr| 821| 700| 26| 60|
+------+-----------+-------+------+---+-----+
Spark has a window function for calculating percentiles which is called percent_rank.
Test df:
from pyspark.sql import SparkSession, functions as F, Window as W
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, "HR", 111, 1000, 45, 100),
(2, "Sales", 596, 500, 30, 50),
(3, "Manufacture", 895, 600, 50, 400),
(4, "HR", 212, 700, 26, 60),
(5, "Business", 754, 350, 18, 22)],
['Empl_No', 'Dept', 'Pincode', 'Salary', 'Age', 'Bonus'])
df.show()
# +-------+-----------+-------+------+---+-----+
# |Empl_No| Dept|Pincode|Salary|Age|Bonus|
# +-------+-----------+-------+------+---+-----+
# | 1| HR| 111| 1000| 45| 100|
# | 2| Sales| 596| 500| 30| 50|
# | 3|Manufacture| 895| 600| 50| 400|
# | 4| HR| 212| 700| 26| 60|
# | 5| Business| 754| 350| 18| 22|
# +-------+-----------+-------+------+---+-----+
percent_rank works in a way that the smallest value gets percentile 0 and the biggest value gets 1.
arr = ['Salary', 'Age', 'Bonus']
df = df.select(
*[c for c in df.columns if c not in arr],
*[F.percent_rank().over(W.orderBy(c)).alias(c) for c in arr]
).sort('Empl_No')
df.show()
# +-------+-----------+-------+------+----+-----+
# |Empl_No| Dept|Pincode|Salary| Age|Bonus|
# +-------+-----------+-------+------+----+-----+
# | 1| HR| 111| 1.0|0.75| 0.75|
# | 2| Sales| 596| 0.25| 0.5| 0.25|
# | 3|Manufacture| 895| 0.5| 1.0| 1.0|
# | 4| HR| 212| 0.75|0.25| 0.5|
# | 5| Business| 754| 0.0| 0.0| 0.0|
# +-------+-----------+-------+------+----+-----+
However, your expectation is somewhat different. You expect it to assume 0 as the smallest value even though it does not exist in the columns.
To solve this I will add a row with 0 values and later it will be deleted.
arr = ['Salary', 'Age', 'Bonus']
# Adding a row containing 0 values
df = df.limit(1).withColumn('Dept', F.lit('_tmp')).select(
*[c for c in df.columns if c not in arr],
*[F.lit(0).alias(c) for c in arr]
).union(df)
# Calculating percentiles
df = df.select(
*[c for c in df.columns if c not in arr],
*[F.percent_rank().over(W.orderBy(c)).alias(c) for c in arr]
).sort('Empl_No')
# Removing the fake row
df = df.filter("Dept != '_tmp'")
df.show()
# +-------+-----------+-------+------+---+-----+
# |Empl_No| Dept|Pincode|Salary|Age|Bonus|
# +-------+-----------+-------+------+---+-----+
# | 1| HR| 111| 1.0|0.8| 0.8|
# | 2| Sales| 596| 0.4|0.6| 0.4|
# | 3|Manufacture| 895| 0.6|1.0| 1.0|
# | 4| HR| 212| 0.8|0.4| 0.6|
# | 5| Business| 754| 0.2|0.2| 0.2|
# +-------+-----------+-------+------+---+-----+
You can multiply the percentile by 100 if you like:
*[(100 * F.percent_rank().over(W.orderBy(c))).alias(c) for c in arr]
Then you get...
+-------+-----------+-------+------+-----+-----+
|Empl_No| Dept|Pincode|Salary| Age|Bonus|
+-------+-----------+-------+------+-----+-----+
| 1| HR| 111| 100.0| 80.0| 80.0|
| 2| Sales| 596| 40.0| 60.0| 40.0|
| 3|Manufacture| 895| 60.0|100.0|100.0|
| 4| HR| 212| 80.0| 40.0| 60.0|
| 5| Business| 754| 20.0| 20.0| 20.0|
+-------+-----------+-------+------+-----+-----+
I have a pyspark dataframe:
date | cust | amount | is_delinquent
---------------------------------------
1/1/20 | A | 5 | 0
13/1/20 | A | 1 | 0
15/1/20 | A | 3 | 1
19/1/20 | A | 4 | 0
20/1/20 | A | 4 | 1
27/1/20 | A | 2 | 0
1/2/20 | A | 2 | 0
5/2/20 | A | 1 | 0
1/1/20 | B | 7 | 0
1/1/20 | B | 5 | 0
Now I want to calculate the average of amount on a period windows of 30 days and filtering the column IS_DELINQUENT is equal to 0. It should skip when IS_DELINQUENT equal to 1 and replace as NaN.
My expected final dataframe is:
date | cust | amount | is_delinquent | avg_amount
----------------------------------------------------------
1/1/20 | A | 5 | 0 | null
13/1/20 | A | 1 | 0 | 5
15/1/20 | A | 3 | 1 | null
19/1/20 | A | 4 | 0 | 3
20/1/20 | A | 4 | 1 | null
27/1/20 | A | 2 | 0 | 3.333
1/2/20 | A | 2 | 0 | null
5/2/20 | A | 1 | 0 | 2
1/1/20 | B | 7 | 0 | null
9/1/20 | B | 5 | 0 | 7
without the filtering, my code would be like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
days = lambda i: i * 86400
w_pay_30x = Window.partitionBy("cust").orderBy(col("date").cast("timestamp").cast("long")).rangeBetween(-days(30), -days(1))
data.withColumn("avg_amount", F.avg("amount").over(w_pay_30x)
Any idea how I can add this filter?
You can use when to calculate and show the average only if is_delinquent is equal to 0. Also you may want to include the month in the partition by clause of the window.
from pyspark.sql import functions as F, Window
days = lambda i: i * 86400
w_pay_30x = (Window.partitionBy("cust", F.month(F.to_timestamp('date', 'd/M/yy')))
.orderBy(F.to_timestamp('date', 'd/M/yy').cast('long'))
.rangeBetween(-days(30), -days(1))
)
data2 = data.withColumn(
'avg_amount',
F.when(
F.col('is_delinquent') == 0,
F.avg(
F.when(
F.col('is_delinquent') == 0,
F.col('amount')
)
).over(w_pay_30x)
)
).orderBy('cust', F.to_timestamp('date', 'd/M/yy'))
data2.show()
+-------+----+------+-------------+------------------+
| date|cust|amount|is_delinquent| avg_amount|
+-------+----+------+-------------+------------------+
| 1/1/20| A| 5| 0| null|
|13/1/20| A| 1| 0| 5.0|
|15/1/20| A| 3| 1| null|
|19/1/20| A| 4| 0| 3.0|
|20/1/20| A| 4| 1| null|
|27/1/20| A| 2| 0|3.3333333333333335|
| 1/2/20| A| 2| 0| null|
| 5/2/20| A| 1| 0| 2.0|
| 1/1/20| B| 7| 0| null|
| 9/1/20| B| 5| 0| 7.0|
+-------+----+------+-------------+------------------+
Assume we have a spark DataFrame that looks like the following (ordered by time):
+------+-------+
| time | value |
+------+-------+
| 1 | A |
| 2 | A |
| 3 | A |
| 4 | B |
| 5 | B |
| 6 | A |
+------+-------+
I'd like to calculate the start/end times of each sequence of uninterrupted values. The expected output from the above DataFrame would be:
+-------+-------+-----+
| value | start | end |
+-------+-------+-----+
| A | 1 | 3 |
| B | 4 | 5 |
| A | 6 | 6 |
+-------+-------+-----+
(The end value for the final row could also be null.)
Doing this with a simple group aggregation:
.groupBy("value")
.agg(
F.min("time").alias("start"),
F.max("time").alias("end")
)
doesn't take into account the fact that the same value can appear in multiple different intervals.
the idea is to create an identifier for each group and use it to group by and compute your min and max time.
assuming df is your dataframe:
from pyspark.sql import functions as F, Window
df = df.withColumn(
"fg",
F.when(
F.lag('value').over(Window.orderBy("time"))==F.col("value"),
0
).otherwise(1)
)
df = df.withColumn(
"rn",
F.sum("fg").over(
Window
.orderBy("time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
)
From that point, you have your dataframe with an identifier for each consecutive group.
df.show()
+----+-----+---+---+
|time|value| rn| fg|
+----+-----+---+---+
| 1| A| 1| 1|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| B| 2| 1|
| 5| B| 2| 0|
| 6| A| 3| 1|
+----+-----+---+---+
then you just have to do the aggregation
df.groupBy(
'value',
"rn"
).agg(
F.min('time').alias("start"),
F.max('time').alias("end")
).drop("rn").show()
+-----+-----+---+
|value|start|end|
+-----+-----+---+
| A| 1| 3|
| B| 4| 5|
| A| 6| 6|
+-----+-----+---+
I have a pyspark dataframe with the following data:
| y | date | amount| id |
-----------------------------
| 1 | 2017-01-01 | 10 | 1 |
| 0 | 2017-01-01 | 2 | 1 |
| 1 | 2017-01-02 | 20 | 1 |
| 0 | 2017-01-02 | 3 | 1 |
| 1 | 2017-01-03 | 2 | 1 |
| 0 | 2017-01-03 | 5 | 1 |
I want to apply a window function, but apply the sum aggregate function only the columns with y==1, but still maintain the other columns.
The window that i would apply is:
w = Window \
.partitionBy(df.id) \
.orderBy(df.date.asc()) \
.rowsBetween(Window.unboundedPreceding, -1)
And the result dataframe would be like:
| y | date | amount| id | sum |
-----------------------------------
| 1 | 2017-01-01 | 10 | 1 | 0 |
| 0 | 2017-01-01 | 2 | 1 | 0 |
| 1 | 2017-01-02 | 20 | 1 | 10 | // =10 (considering only the row with y==1)
| 0 | 2017-01-02 | 3 | 1 | 10 | // same as above
| 1 | 2017-01-03 | 2 | 1 | 30 | // =10+20
| 0 | 2017-01-03 | 5 | 1 | 30 | // same as above
Is this feasible anyhow?
I tried to use the sum(when(df.y==1, df.amount)).over(w) but didn't return the correct results.
Actually it is difficult to handle it with using one window function. I think you should create some dummy columns first to calculate sum column. You can find my solution below.
>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as F
>>>
>>> df.show()
+---+----------+------+---+
| y| date|amount| id|
+---+----------+------+---+
| 1|2017-01-01| 10| 1|
| 0|2017-01-01| 2| 1|
| 1|2017-01-02| 20| 1|
| 0|2017-01-02| 3| 1|
| 1|2017-01-03| 2| 1|
| 0|2017-01-03| 5| 1|
+---+----------+------+---+
>>>
>>> df = df.withColumn('c1', F.when(F.col('y')==1,F.col('amount')).otherwise(0))
>>>
>>> window1 = Window.partitionBy(df.id).orderBy(df.date.asc()).rowsBetween(Window.unboundedPreceding, -1)
>>> df = df.withColumn('c2', F.sum(df.c1).over(window1)).fillna(0)
>>>
>>> window2 = Window.partitionBy(df.id).orderBy(df.date.asc())
>>> df = df.withColumn('c3', F.lag(df.c2).over(window2)).fillna(0)
>>>
>>> df = df.withColumn('sum', F.when(df.y==0,df.c3).otherwise(df.c2))
>>>
>>> df = df.select('y','date','amount','id','sum')
>>>
>>> df.show()
+---+----------+------+---+---+
| y| date|amount| id|sum|
+---+----------+------+---+---+
| 1|2017-01-01| 10| 1| 0|
| 0|2017-01-01| 2| 1| 0|
| 1|2017-01-02| 20| 1| 10|
| 0|2017-01-02| 3| 1| 10|
| 1|2017-01-03| 2| 1| 30|
| 0|2017-01-03| 5| 1| 30|
+---+----------+------+---+---+
This solution may not work if there if there is multiple y=1 or y=0 rows per day, please consider it
This question already has answers here:
Explode (transpose?) multiple columns in Spark SQL table
(3 answers)
Closed 4 years ago.
I have a dataframe in spark:
id | itemid | itemquant | itemprice
-------------------------------------------------
A | 1,2,3 | 2,2,1 | 30,19,10
B | 3,5 | 5,8 | 18,40
Here all the columns are of string datatype.
How can I use explode function across multiple columns and create a new dataframe shown below:
id | itemid | itemquant | itemprice
-------------------------------------------------
A | 1 | 2 | 30
A | 2 | 2 | 19
A | 3 | 1 | 10
B | 3 | 5 | 18
B | 5 | 8 | 40
Here in the new dataframe also, all the columns are of string datatype.
you need an UDF for that:
val df = Seq(
("A","1,2,3","2,2,1","30,19,10"),
("B","3,5","5,8","18,40")
).toDF("id","itemid","itemquant","itemprice")
val splitAndZip = udf((col1:String,col2:String,col3:String) => {
col1.split(',').zip(col2.split(',')).zip(col3.split(',')).map{case ((a,b),c) => (a,b,c)}
})
df
.withColumn("tmp",explode(splitAndZip($"itemId",$"itemquant",$"itemprice")))
.select(
$"id",
$"tmp._1".as("itemid"),
$"tmp._2".as("itemquant"),
$"tmp._3".as("itemprice")
)
.show()
+---+------+---------+---------+
| id|itemid|itemquant|itemprice|
+---+------+---------+---------+
| A| 1| 2| 30|
| A| 2| 2| 19|
| A| 3| 1| 10|
| B| 3| 5| 18|
| B| 5| 8| 40|
+---+------+---------+---------+