I have data that looks like this
Name
Amount
Start
End
A
$1
9/1/22
10/31/22
B
$3
10/15/22
12/2/22
C
$4
9/18/22
9/30/22
I would like to spread the amount over the number of months in between both and take the final aggregate. So I would like the result to look like the following
Sept
Oct
Nov
Dec
$4.5
$1.5
$1
$1
A: $1 would be spread over September and October ($0.5 each)
B: $3 would be spread over 3 months October, November & December ($1 each) (Yes, December counts as a full month, should be blind to the day)
C: $4 Would all land in September
Bonus 1:
How can I aggregate by Quarter?
Bonus 2: Is there a way in which I can weight the spread even further. So for example: have the value spread over the days and then aggregated. So take customer B for example: we would spread the $3 over (47 days) - 15 days in October, 30 days in November & 2 days for December. That would look like
Oct
Nov
Dec
$3x(15/47)
$3x(30/47)
$3(2/47)
This solution will use a package called staircase which is part of the pandas ecosystem and designed to work with (mathematical) step functions. Any time your data is dealing with "starts" and "ends" you can ask yourself whether your data is representing step functions.
setup
Create the dataframe (Name column seems irrelevant) and make sure dates are pandas.Timestamp
import pandas as pd
df = pd.DataFrame(
{
"Amount":[1,3,4],
"Start":["2022-09-01", "2022-10-15", "2022-09-18"],
"End":["2022-10-31", "2022-12-02", "2022-09-30"],
}
)
df[["Start", "End"]] = df[["Start", "End"]].apply(pd.to_datetime)
solution
We'll go straight to "Bonus 2".
Create a step function using the staircase.Stairs object - it represents a step function and is to staircase what pandas.Series is to pandas.
import staircase as sc
sf = sc.Stairs(frame=df, start="Start", end="End", value="Amount")
sf will increase at points given by the "Start" column, and decrease at points given by the "End" column. The increase/decrease will be given by the Amount column.
You can even plot your step function to look at it
sf.plot()
Now create "cuts" for your monthly buckets
months = pd.period_range("2022-09", "2022-12", freq="M")
cuts = months.union([months[-1]+1]).start_time
cuts is a pandas.DatetimeIndex and looks like this
DatetimeIndex(['2022-09-01', '2022-10-01', '2022-11-01', '2022-12-01',
'2023-01-01'],
dtype='datetime64[ns]', freq='MS')
Then slice the step function up by this bucket and use the mean function, which will give the average value of the step function in each bucket - this is "spreading" it out
sf.slice(cuts).mean()
The result is a pandas.Series indexed by your monthly intervals
[2022-09-01, 2022-10-01) 2.600000
[2022-10-01, 2022-11-01) 2.612903
[2022-11-01, 2022-12-01) 3.000000
[2022-12-01, 2023-01-01) 0.096774
dtype: float64
If you want to aggregate by quarter, then define your cuts to be the points which define quarters - the above approach is very flexible.
note: I am the creator of staircase, and happy to answer any questions you may have.
Related
i have a database that contains all flights data for 2019. I want to plot a time series where the y-axis is the number of flights that are delayed ('DEP_DELAY_NEW')and x-axis is the day of the week.
The day of the week column is an integer, i.e. 1 is Monday, 2 is Tuesday etc.
`# only select delayed flights`
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] >0]
delayed_flights['DAY_OF_WEEK'].value_counts()
1 44787
7 40678
2 33145
5 29629
4 27991
3 26499
6 24847
Name: DAY_OF_WEEK, dtype: int64
How do i convert the above into a time series? Additionally how do i change the integer for the 'day of week' into a string (i.e. 'Monday instead of '1'). i couldn't find the answer to those questions in this forum. Thank you
Let's break down the problem into two parts.
Converting the num_delayed columns into a time series
I am not sure what you meant by a time-series here. But the below code would work well for your plotting purpose.
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] > 0]
delayed_series = delayed_flights['DAY_OF_WEEK'].value_counts()
delayed_df = pd.DataFrame(delayed_series, columns=['NUM_DELAYS'])
delayed_array = delayed_df['NUM_DELAYS'].values
delayed_array contains the array of delayed flight counts in order.
Converting the day in int into a weekday
You can easily do this by using the calendar module.
>>> import calendar
>>> calendar.day_name[0]
'Monday'
If Monday is not the first day of week, you can use setfirstweekday to change it.
In your case, your day integers are 1-indexed and hence you would need to subtract 1 to make it 0-indexed. Another easy workaround would be to define a dictionary with keys as day_int and values as weekday.
I'm new to Stackoverflow and fairly fresh with Python (some 5 months give or take), so apologies if I'm not explaining this too clearly!
I want to build up a historic trend of the average age of outstanding incidents on a daily basis.
I have two dataframes.
df1 contains incident data going back 8 years, with the two most relevant columns being "opened_at" and "resolved_at" which contains datetime values.
df2 contains a column called date with the full date range from 2012-06-13 to now.
The goal is to have df2 contain the number of outstanding incidents on each date (as of 00:00:00) and the average age of all those deemed outstanding.
I know it's possible to get all rows that exist between two dates, but I believe I want the opposite and find where each date row in df2 exists between dates in opened_at and resolved_at in df1
(It would be helpful to have some example code containing an anonymized/randomized short extract of your data to try solutions on)
This is unlikely to be the most efficient solution, but I believe you could do:
df2["incs_open"] = 0 # Ensure the column exists
for row_num in range(df2.shape[0]):
df2.at[row_num, "incs_open"] = sum(
(df1["opened_at"] < df2.at[row_num, "date"]) &
(df2.at[row_num, "date"] < df1["opened_at"])
)
(This assumes you haven't set an index on the data frame other than the default one)
For the second part of your question, the average age, I believe you can calculate that in the body of the loop like this:
open_incs = (df1["opened_at"] < df2.at[row_num, "date"]) & \
(df2.at[row_num, "date"] < df1["opened_at"])
ages_of_open_incs = df2.at[row_num, "date"] - df1.loc[open_incs, "opened_at"]
avg_age = ages_of_open_incs.mean()
You'll hit some weirdnesses about rounding and things. If an incident was opened last night at 3am, what is its age "today" -- 1 day, 0 days, 9 hours (if we take noon as the point to count from), etc. -- but I assume once you've got code that works you can adjust that to taste.
I have this dataframe:
date amount
2018/01 100
2018/02 105
2018/03 110.25
2018/04 200
As you can see, every month, the amount is increasing by 5% of the previous value. However, every the 4th month (2018/04), this rule does not apply. Instead, it should only past the constant value of 200 for example.
How do I program this in pandas dataframe?
#Lroy_12374 It's not clear what would happen in month's 5-8 and beyond, which would affect how to write the logic. For example:
a) Should month 5 be 5% higher than month 3? OR
b) should it be 5% higher than every fourth month (i.e. April 2018, August 2018, December 2018, April 2019, August 2019, December 2019, etc.)? OR
c) Should it be 5% higher than Month 4 had month 4 not been a constant, which means that Month 5 is 1.05^2*(Month 3).
Also, the definition of a constant is not clear. Literally, will it be 200 or something for every fourth month? Or, will it be a different number that does not follow the pattern of the other 3 months.
I have written some code for scenario c) above:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date' : ['2018/01','2018/02','2018/03',
'2018/04','2018/05','2018/06','2018/07', '2018/08']})
start_amount = 100
constant=200
growth=.05
df['amount'] = np.where((df.index+1)%4 != 0,
start_amount * (1+growth) ** df.index, constant)
df
The key here is to use np.where and implement logic based on the row number, which you can get with df.index. What I am doing in the code above is adding 1 to the row (df.index+1), since python starts counting at 0 and you want logic based on the fourth month. Then, I am using the % symbol, which returns the remainder after dividing, which you want to equal zero if it is the fourth row (i.e. 4/4 = remainder 0). So, basically, where something is not every fourth row you want to multiply by 1.05 (5% increase) RAISED according to the row number, and where it is the fourth row you want to return a constant.
I hope this helps.
I'm looking to perform data analysis on 100-years of climatological data for select U.S. locations (8 in particular), for each day spanning the 100-years. I have a pandas dataFrame set up with columns for Max temperature, Min temperature, Avg temperature, Snowfall, Precip Total, and then Day, Year, and Month values (then, I have an index also based on a date-time value). Right now, I want to set up a for loop to print the first Maximum temperature of 90 degrees F or greater from each year, but ONLY the first. Eventually, I want to narrow this down to each of my 8 locations, but first I just want to get the for loop to work.
Experimented with various iterations of a for loop.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
Unsurprisingly, the output of the loop I provided prints the first 90 degree day period (from the year 1919, the beginning of my data frame) and breaks.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
1919-06-12 00:00:00
That's fine. If I take out the break statement, all of the 90 degree days print, including multiple in the same year. I just want the first value from each year to print. Do I need to set up a second for loop to increment through the year? If I explicitly state the year, ala below, while trying to loop through a counter, the loop still begins in 1919 and eventually reaches an out of bounds index. I know this logic is incorrect.
count = 1919
while count < 2019:
for year in range(len(climate['Year'])):
if (climate[climate['Year']==count]['Max'][year] >=90).all():
print (climate.index[year])
count = count+1
Any input is sincerely appreciated.
You can achieve this without having a second for loop. Assuming the climate dataframe is ordered chronologically, this should do what you want:
current_year = None
for i in range(climate.shape[0]):
if climate['Max'][i] >= 90 and climate['Year'][i] != current_year:
print(climate.index[i])
current_year = climate['Year'][i]
Notice that we're using the current_year variable to keep track of the latest year that we've already printed the result for. Then, in the if check, we're checking if we've already printed a result for the year of the current row in the loop.
That's one way to do it, but I would suggest taking a look at pandas.DataFrame.groupby because I think it fits your use case well. You could get a dataframe that contains the first >=90 max days per year with the following (again assuming climate is ordered chronologically):
climate[climate.Max >= 90].groupby('Year').first()
This just filters the dataframe to only contain the >=90 max days, groups rows from the same year together, and retains only the first row from each group. If you had an additional column Location, you could extend this to get the same except per location per year:
climate[climate.Max >= 90].groupby(['Location', 'Year']).first()
I have the following data in a Pandas df:
index;Aircraft_Registration;issue;Leg_Number;Departure_Time;Departure_Date;Arrival_Time;Arrival_Date;Departure_Airport;Arrival_Airport
0;XXA;0;QQ464;01:07:00;2013-12-01;03:33:00;2013-12-01;JFK;AMS
1;XXA;0;QQQ445;06:08:00;2013-12-01;12:02:00;2013-12-01;AMS;CPT
2;XXA;0;QQQ446;13:04:00;2013-12-01;13:13:00;2013-12-01;JFK;SID
3;XXA;0;QQ446;14:17:00;2013-12-01;20:15:00;2013-12-01;SID;FRA
4;XXA;0;QQ453;02:02:00;2013-12-02;13:09:00;2013-12-02;JFK;BJL
5;XXA;0;QQ150;05:47:00;2018-12-03;12:37:00;2018-03-03;KAO;AMS
6;XXA;0;QQ457;15:09:00;2018-11-03;17:51:00;2018-03-03;AMS;AGP
7;XXA;0;QQ457;08:34:00;2018-12-03;22:47:00;2018-03-03;AGP;JFK
8;XXA;0;QQ458;03:34:00;2018-12-03;23:59:00;2018-03-03;ATL;BJL
9;XXA;0;QQ458;06:26:00;2018-10-04;07:01:00;2018-03-04;BJL;AMS
I want to group this data on the month ignoring the year so ideally would end up with 12 new dataframes each representing the events of that months ignoring the year.
I tried the following:
sort = list(df.groupby(pd.Grouper(freq='M', key='Departure_Date')))
This results in a list containing a data frame for each month and year, in this case yielding 60 lists of which many are empty since there is no data for that month.
My expected result is a list containing 12 dataframes, one for each month (January, Februari etc.)
I think need dt.month for 1-12 months or dt.strftime for January-December:
sort = list(df.groupby(df['Departure_Date'].dt.month))
Or:
sort = list(df.groupby(df['Departure_Date'].dt.strftime('%B')))