% difference over columns in PySpark for each row - apache-spark

I am trying to compute percentage difference over columns for each row in a dataframe. Here is my dataset:
For example, for the first row, I am trying to get a variation rate of 2016 compared to 2015, of 2017 compared to 2016... Only 2015 and 2019 should be removed, so that they will be 5 columns at the end.
I know that window and lag can be help achieving it, but I stay unsuccessful until now.

No window functions should be needed. You just need to calculate the % change by arithmetic operations on the columns, if I understood the question correctly.
import pyspark.sql.functions as F
df2 = df.select(
'city', 'postal_code',
*[((F.col(str(year)) - F.col(str(year-1))) / F.col(str(year-1))).alias('percent_change_%s'%year)
for year in [2016,2017,2018,2019]]
)
Also I don't understand why you want 5 columns at the end. Isn't it 6? Why is 2019 removed? You can calculate % change by (2019-2018)/2018, for instance.

Related

Spread Amount Over Months Between Two Dates

I have data that looks like this
Name
Amount
Start
End
A
$1
9/1/22
10/31/22
B
$3
10/15/22
12/2/22
C
$4
9/18/22
9/30/22
I would like to spread the amount over the number of months in between both and take the final aggregate. So I would like the result to look like the following
Sept
Oct
Nov
Dec
$4.5
$1.5
$1
$1
A: $1 would be spread over September and October ($0.5 each)
B: $3 would be spread over 3 months October, November & December ($1 each) (Yes, December counts as a full month, should be blind to the day)
C: $4 Would all land in September
Bonus 1:
How can I aggregate by Quarter?
Bonus 2: Is there a way in which I can weight the spread even further. So for example: have the value spread over the days and then aggregated. So take customer B for example: we would spread the $3 over (47 days) - 15 days in October, 30 days in November & 2 days for December. That would look like
Oct
Nov
Dec
$3x(15/47)
$3x(30/47)
$3(2/47)
This solution will use a package called staircase which is part of the pandas ecosystem and designed to work with (mathematical) step functions. Any time your data is dealing with "starts" and "ends" you can ask yourself whether your data is representing step functions.
setup
Create the dataframe (Name column seems irrelevant) and make sure dates are pandas.Timestamp
import pandas as pd
df = pd.DataFrame(
{
"Amount":[1,3,4],
"Start":["2022-09-01", "2022-10-15", "2022-09-18"],
"End":["2022-10-31", "2022-12-02", "2022-09-30"],
}
)
df[["Start", "End"]] = df[["Start", "End"]].apply(pd.to_datetime)
solution
We'll go straight to "Bonus 2".
Create a step function using the staircase.Stairs object - it represents a step function and is to staircase what pandas.Series is to pandas.
import staircase as sc
sf = sc.Stairs(frame=df, start="Start", end="End", value="Amount")
sf will increase at points given by the "Start" column, and decrease at points given by the "End" column. The increase/decrease will be given by the Amount column.
You can even plot your step function to look at it
sf.plot()
Now create "cuts" for your monthly buckets
months = pd.period_range("2022-09", "2022-12", freq="M")
cuts = months.union([months[-1]+1]).start_time
cuts is a pandas.DatetimeIndex and looks like this
DatetimeIndex(['2022-09-01', '2022-10-01', '2022-11-01', '2022-12-01',
'2023-01-01'],
dtype='datetime64[ns]', freq='MS')
Then slice the step function up by this bucket and use the mean function, which will give the average value of the step function in each bucket - this is "spreading" it out
sf.slice(cuts).mean()
The result is a pandas.Series indexed by your monthly intervals
[2022-09-01, 2022-10-01) 2.600000
[2022-10-01, 2022-11-01) 2.612903
[2022-11-01, 2022-12-01) 3.000000
[2022-12-01, 2023-01-01) 0.096774
dtype: float64
If you want to aggregate by quarter, then define your cuts to be the points which define quarters - the above approach is very flexible.
note: I am the creator of staircase, and happy to answer any questions you may have.

calculate average difference between dates using pyspark

I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.
The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.
ID
Date
1
2020-09-03
1
2020-09-03
2
2020-09-02
1
2020-09-04
2
2020-09-06
2
2020-09-16
the needed outcome for this example will be:
ID
average difference
1
0.5
2
7
thanks for helping!
You can use datediff with window function to calculate the difference, then take an average.
lag is one of the window function and it will take a value from the previous row within the window.
from pyspark.sql import functions as F
# define the window
w = Window.partitionBy('ID').orderBy('Date')
# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
.groupby('ID') # aggregate over ID
.agg(F.avg(F.col('diff')).alias('average difference'))
)

How to build a time series with Matplotlib

i have a database that contains all flights data for 2019. I want to plot a time series where the y-axis is the number of flights that are delayed ('DEP_DELAY_NEW')and x-axis is the day of the week.
The day of the week column is an integer, i.e. 1 is Monday, 2 is Tuesday etc.
`# only select delayed flights`
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] >0]
delayed_flights['DAY_OF_WEEK'].value_counts()
1 44787
7 40678
2 33145
5 29629
4 27991
3 26499
6 24847
Name: DAY_OF_WEEK, dtype: int64
How do i convert the above into a time series? Additionally how do i change the integer for the 'day of week' into a string (i.e. 'Monday instead of '1'). i couldn't find the answer to those questions in this forum. Thank you
Let's break down the problem into two parts.
Converting the num_delayed columns into a time series
I am not sure what you meant by a time-series here. But the below code would work well for your plotting purpose.
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] > 0]
delayed_series = delayed_flights['DAY_OF_WEEK'].value_counts()
delayed_df = pd.DataFrame(delayed_series, columns=['NUM_DELAYS'])
delayed_array = delayed_df['NUM_DELAYS'].values
delayed_array contains the array of delayed flight counts in order.
Converting the day in int into a weekday
You can easily do this by using the calendar module.
>>> import calendar
>>> calendar.day_name[0]
'Monday'
If Monday is not the first day of week, you can use setfirstweekday to change it.
In your case, your day integers are 1-indexed and hence you would need to subtract 1 to make it 0-indexed. Another easy workaround would be to define a dictionary with keys as day_int and values as weekday.

Finding average age of incidents in a datetime series

I'm new to Stackoverflow and fairly fresh with Python (some 5 months give or take), so apologies if I'm not explaining this too clearly!
I want to build up a historic trend of the average age of outstanding incidents on a daily basis.
I have two dataframes.
df1 contains incident data going back 8 years, with the two most relevant columns being "opened_at" and "resolved_at" which contains datetime values.
df2 contains a column called date with the full date range from 2012-06-13 to now.
The goal is to have df2 contain the number of outstanding incidents on each date (as of 00:00:00) and the average age of all those deemed outstanding.
I know it's possible to get all rows that exist between two dates, but I believe I want the opposite and find where each date row in df2 exists between dates in opened_at and resolved_at in df1
(It would be helpful to have some example code containing an anonymized/randomized short extract of your data to try solutions on)
This is unlikely to be the most efficient solution, but I believe you could do:
df2["incs_open"] = 0 # Ensure the column exists
for row_num in range(df2.shape[0]):
df2.at[row_num, "incs_open"] = sum(
(df1["opened_at"] < df2.at[row_num, "date"]) &
(df2.at[row_num, "date"] < df1["opened_at"])
)
(This assumes you haven't set an index on the data frame other than the default one)
For the second part of your question, the average age, I believe you can calculate that in the body of the loop like this:
open_incs = (df1["opened_at"] < df2.at[row_num, "date"]) & \
(df2.at[row_num, "date"] < df1["opened_at"])
ages_of_open_incs = df2.at[row_num, "date"] - df1.loc[open_incs, "opened_at"]
avg_age = ages_of_open_incs.mean()
You'll hit some weirdnesses about rounding and things. If an incident was opened last night at 3am, what is its age "today" -- 1 day, 0 days, 9 hours (if we take noon as the point to count from), etc. -- but I assume once you've got code that works you can adjust that to taste.

How to get cumulative growth in pandas given a growth rate and special rules?

I have this dataframe:
date amount
2018/01 100
2018/02 105
2018/03 110.25
2018/04 200
As you can see, every month, the amount is increasing by 5% of the previous value. However, every the 4th month (2018/04), this rule does not apply. Instead, it should only past the constant value of 200 for example.
How do I program this in pandas dataframe?
#Lroy_12374 It's not clear what would happen in month's 5-8 and beyond, which would affect how to write the logic. For example:
a) Should month 5 be 5% higher than month 3? OR
b) should it be 5% higher than every fourth month (i.e. April 2018, August 2018, December 2018, April 2019, August 2019, December 2019, etc.)? OR
c) Should it be 5% higher than Month 4 had month 4 not been a constant, which means that Month 5 is 1.05^2*(Month 3).
Also, the definition of a constant is not clear. Literally, will it be 200 or something for every fourth month? Or, will it be a different number that does not follow the pattern of the other 3 months.
I have written some code for scenario c) above:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date' : ['2018/01','2018/02','2018/03',
'2018/04','2018/05','2018/06','2018/07', '2018/08']})
start_amount = 100
constant=200
growth=.05
df['amount'] = np.where((df.index+1)%4 != 0,
start_amount * (1+growth) ** df.index, constant)
df
The key here is to use np.where and implement logic based on the row number, which you can get with df.index. What I am doing in the code above is adding 1 to the row (df.index+1), since python starts counting at 0 and you want logic based on the fourth month. Then, I am using the % symbol, which returns the remainder after dividing, which you want to equal zero if it is the fourth row (i.e. 4/4 = remainder 0). So, basically, where something is not every fourth row you want to multiply by 1.05 (5% increase) RAISED according to the row number, and where it is the fourth row you want to return a constant.
I hope this helps.

Resources