Finding average age of incidents in a datetime series - python-3.x

I'm new to Stackoverflow and fairly fresh with Python (some 5 months give or take), so apologies if I'm not explaining this too clearly!
I want to build up a historic trend of the average age of outstanding incidents on a daily basis.
I have two dataframes.
df1 contains incident data going back 8 years, with the two most relevant columns being "opened_at" and "resolved_at" which contains datetime values.
df2 contains a column called date with the full date range from 2012-06-13 to now.
The goal is to have df2 contain the number of outstanding incidents on each date (as of 00:00:00) and the average age of all those deemed outstanding.
I know it's possible to get all rows that exist between two dates, but I believe I want the opposite and find where each date row in df2 exists between dates in opened_at and resolved_at in df1

(It would be helpful to have some example code containing an anonymized/randomized short extract of your data to try solutions on)
This is unlikely to be the most efficient solution, but I believe you could do:
df2["incs_open"] = 0 # Ensure the column exists
for row_num in range(df2.shape[0]):
df2.at[row_num, "incs_open"] = sum(
(df1["opened_at"] < df2.at[row_num, "date"]) &
(df2.at[row_num, "date"] < df1["opened_at"])
)
(This assumes you haven't set an index on the data frame other than the default one)
For the second part of your question, the average age, I believe you can calculate that in the body of the loop like this:
open_incs = (df1["opened_at"] < df2.at[row_num, "date"]) & \
(df2.at[row_num, "date"] < df1["opened_at"])
ages_of_open_incs = df2.at[row_num, "date"] - df1.loc[open_incs, "opened_at"]
avg_age = ages_of_open_incs.mean()
You'll hit some weirdnesses about rounding and things. If an incident was opened last night at 3am, what is its age "today" -- 1 day, 0 days, 9 hours (if we take noon as the point to count from), etc. -- but I assume once you've got code that works you can adjust that to taste.

Related

Pandas - Sum of variant number of products

I have two dataframes:
First one contains the information about number of daily arrivals of people in reference to some particular day of interest e.g. '-2' column header refers to number of people arriving on a day that is 2 days prior to my day of interest etc. These arrivals are split between two groups ('Rest of US' and 'International') in terms of a multiindex. Each row of a dataframe showcases different expected number of arrivals based on the number of tickets owned. To sum up, a single value of that dataframe showcases the number of people arriving on a certain day, given particular group of interest and the number of ticket holdings we consider.
import pandas as pd
import numpy as np
arrivals_US=np.array(
[[0.00000000e+00, 1.32853314e+03, 7.92624282e+03, 2.36446211e+04,
4.70225402e+04, 7.01360136e+04, 8.36885523e+04, 8.32165654e+04,
7.09262060e+04, 5.28946471e+04, 3.50642212e+04, 2.09198795e+04,
1.13464885e+04, 5.64124825e+03, 2.58896895e+03, 1.10330043e+03,
4.38831214e+02, 1.63633610e+02, 5.74273229e+01, 1.90344814e+01,
5.97698863e+00, 1.78298388e+00],
[0.00000000e+00, 5.95406456e+03, 2.65918120e+04, 5.93816593e+04,
8.84026873e+04, 9.87051626e+04, 8.81666331e+04, 6.56277324e+04,
4.18720044e+04, 2.33758901e+04, 1.16000548e+04, 5.18077143e+03,
2.10346912e+03, 7.82869343e+02, 2.68955436e+02, 8.57998806e+01,
2.55464064e+01, 7.13089804e+00, 1.87339647e+00, 4.64827256e-01,
1.09262813e-01, 2.43992667e-02]])
arrivals_Int = np.array(
[[1.80595142e+02, 1.01558052e+03, 2.85556902e+03, 5.35278380e+03,
7.52537259e+03, 8.46381767e+03, 7.93274228e+03, 6.37284860e+03,
4.47973074e+03, 2.79909538e+03, 1.57407707e+03, 8.04714088e+02,
3.77110526e+02, 1.63129911e+02, 6.55260304e+01, 2.45657991e+01,
8.63414246e+00, 2.85613404e+00, 8.92307156e-01, 2.64100407e-01,
7.42587051e-02, 1.98854935e-02],
[3.33606865e+03, 1.20883137e+04, 2.19011273e+04, 2.64530626e+04,
2.39633048e+04, 1.73663061e+04, 1.04878615e+04, 5.42899758e+03,
2.45901062e+03, 9.90030648e+02, 3.58739652e+02, 1.18172777e+02,
3.56834584e+01, 9.94613456e+00, 2.57428744e+00, 6.21865585e-01,
1.40833925e-01, 3.00185083e-02, 6.04292696e-03, 1.15245636e-03,
2.08797472e-04, 3.60277122e-05]])
arrivals = [arrivals_US, arrivals_Int]
days=list(range(-5,17))
tickets=[1,2]
arrivals_df = pd.concat([pd.DataFrame(dict(zip(days,arrival_type.T)), index=tickets) for arrival_type in arrivals], axis=1, keys=["Rest of US", "International"])
Second one has similar structure as the first one but instead of the number of arrivals, it represents a distribution of number of days stayed given the group type and number of tickets owned e.g. on average 24% of people who are being part of group 'Rest of US' and having a single ticket will stay for 3 days etc.
stays_dist_US=np.array(
[[4.43244820e-03, 5.39982734e-02, 2.42003472e-01, 3.98996272e-01,
2.42003472e-01, 5.39982734e-02, 4.43244820e-03, 1.33848338e-04,
1.48692072e-06, 6.07670514e-09, 9.13595666e-12, 5.05295484e-15,
1.02811648e-18, 7.69563998e-23, 2.11910601e-27, 2.14667422e-32,
7.99991029e-38, 1.09675497e-43, 5.53145805e-50, 1.02630195e-56,
7.00513005e-64, 1.75898756e-71],
[1.33830425e-04, 4.43185500e-03, 5.39910468e-02, 2.41971084e-01,
3.98942874e-01, 2.41971084e-01, 5.39910468e-02, 4.43185500e-03,
1.33830425e-04, 1.48672173e-06, 6.07589189e-09, 9.13473400e-12,
5.05227860e-15, 1.02797889e-18, 7.69461007e-23, 2.11882241e-27,
2.14638693e-32, 7.99883965e-38, 1.09660819e-43, 5.53071778e-50,
1.02616460e-56, 7.00419255e-64]])
stays_dist_Int = np.array(
[[5.05227106e-15, 9.13472036e-12, 6.07588282e-09, 1.48671951e-06,
1.33830225e-04, 4.43184839e-03, 5.39909662e-02, 2.41970723e-01,
3.98942278e-01, 2.41970723e-01, 5.39909662e-02, 4.43184839e-03,
1.33830225e-04, 1.48671951e-06, 6.07588282e-09, 9.13472036e-12,
5.05227106e-15, 1.02797735e-18, 7.69459859e-23, 2.11881924e-27,
2.14638372e-32, 7.99882771e-38],
[1.02797735e-18, 5.05227106e-15, 9.13472036e-12, 6.07588282e-09,
1.48671951e-06, 1.33830225e-04, 4.43184839e-03, 5.39909662e-02,
2.41970723e-01, 3.98942278e-01, 2.41970723e-01, 5.39909662e-02,
4.43184839e-03, 1.33830225e-04, 1.48671951e-06, 6.07588282e-09,
9.13472036e-12, 5.05227106e-15, 1.02797735e-18, 7.69459859e-23,
2.11881924e-27, 2.14638372e-32]])
stays_dist = [stays_dist_US, stays_dist_Int]
len_of_stay = list(range(1,23))
tickets=[1,2]
stays_df = pd.concat([pd.DataFrame(dict(zip(len_of_stay,stays_dist_type.T)), index=tickets) for stays_dist_type in stays_dist], axis=1, keys=["Rest of US", "International"])
I would like to create a 3rd dataframe which would depict number of departures on each day (having exactly the same structure of columns and index as the 1st one), by doing some sort of 'variant sum of products' (not sure how to name it) for example:
number of 'Rest of US' people leaving at day '-4', given that they have k number of tickets should be represented as a product of number of these type of people arriving at day '-5' (value from the 1st dataframe) and percentage of these type of people leaving after single day (value from my length of stay distribution - 2nd dataframe),
number of 'Rest of US' people leaving at day '-3', given that they have k number of tickets should be represented as sum of two products:
Number of people arriving at day '-5' and percentage of people staying for 2 days,
Number of people arriving at day '-4' and percentage of people staying for 1 day,
number of 'Rest of US' people leaving at day '-2', given that they have k number of tickets should be represented as sum of three products:
Number of people arriving at day '-5' and percentage of people staying for 3 days,
Number of people arriving at day '-4' and percentage of people staying for 2 days,
Number of people arriving at day '-3' and percentage of people staying for 1 day.
and so it goes all the way till all days are considered for given group and then the same is repeated for the next one.
So far I've done it in Excel by manually expanding the formula with each new day I was considering:
example of the calculation in excel for day '-2'
I tried to replicate it in Python with some for loops but it's messy and probably not efficient.
Is there any clever way one can do this with some pandas/numpy methods?
Thank you for any help and suggestions!

How to build a time series with Matplotlib

i have a database that contains all flights data for 2019. I want to plot a time series where the y-axis is the number of flights that are delayed ('DEP_DELAY_NEW')and x-axis is the day of the week.
The day of the week column is an integer, i.e. 1 is Monday, 2 is Tuesday etc.
`# only select delayed flights`
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] >0]
delayed_flights['DAY_OF_WEEK'].value_counts()
1 44787
7 40678
2 33145
5 29629
4 27991
3 26499
6 24847
Name: DAY_OF_WEEK, dtype: int64
How do i convert the above into a time series? Additionally how do i change the integer for the 'day of week' into a string (i.e. 'Monday instead of '1'). i couldn't find the answer to those questions in this forum. Thank you
Let's break down the problem into two parts.
Converting the num_delayed columns into a time series
I am not sure what you meant by a time-series here. But the below code would work well for your plotting purpose.
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] > 0]
delayed_series = delayed_flights['DAY_OF_WEEK'].value_counts()
delayed_df = pd.DataFrame(delayed_series, columns=['NUM_DELAYS'])
delayed_array = delayed_df['NUM_DELAYS'].values
delayed_array contains the array of delayed flight counts in order.
Converting the day in int into a weekday
You can easily do this by using the calendar module.
>>> import calendar
>>> calendar.day_name[0]
'Monday'
If Monday is not the first day of week, you can use setfirstweekday to change it.
In your case, your day integers are 1-indexed and hence you would need to subtract 1 to make it 0-indexed. Another easy workaround would be to define a dictionary with keys as day_int and values as weekday.

Print the first value of a dataframe based on condition, then iterate to the next sequence

I'm looking to perform data analysis on 100-years of climatological data for select U.S. locations (8 in particular), for each day spanning the 100-years. I have a pandas dataFrame set up with columns for Max temperature, Min temperature, Avg temperature, Snowfall, Precip Total, and then Day, Year, and Month values (then, I have an index also based on a date-time value). Right now, I want to set up a for loop to print the first Maximum temperature of 90 degrees F or greater from each year, but ONLY the first. Eventually, I want to narrow this down to each of my 8 locations, but first I just want to get the for loop to work.
Experimented with various iterations of a for loop.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
Unsurprisingly, the output of the loop I provided prints the first 90 degree day period (from the year 1919, the beginning of my data frame) and breaks.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
1919-06-12 00:00:00
That's fine. If I take out the break statement, all of the 90 degree days print, including multiple in the same year. I just want the first value from each year to print. Do I need to set up a second for loop to increment through the year? If I explicitly state the year, ala below, while trying to loop through a counter, the loop still begins in 1919 and eventually reaches an out of bounds index. I know this logic is incorrect.
count = 1919
while count < 2019:
for year in range(len(climate['Year'])):
if (climate[climate['Year']==count]['Max'][year] >=90).all():
print (climate.index[year])
count = count+1
Any input is sincerely appreciated.
You can achieve this without having a second for loop. Assuming the climate dataframe is ordered chronologically, this should do what you want:
current_year = None
for i in range(climate.shape[0]):
if climate['Max'][i] >= 90 and climate['Year'][i] != current_year:
print(climate.index[i])
current_year = climate['Year'][i]
Notice that we're using the current_year variable to keep track of the latest year that we've already printed the result for. Then, in the if check, we're checking if we've already printed a result for the year of the current row in the loop.
That's one way to do it, but I would suggest taking a look at pandas.DataFrame.groupby because I think it fits your use case well. You could get a dataframe that contains the first >=90 max days per year with the following (again assuming climate is ordered chronologically):
climate[climate.Max >= 90].groupby('Year').first()
This just filters the dataframe to only contain the >=90 max days, groups rows from the same year together, and retains only the first row from each group. If you had an additional column Location, you could extend this to get the same except per location per year:
climate[climate.Max >= 90].groupby(['Location', 'Year']).first()

Moving average excluding weekends and holidays

I have a table within PowerPivot currently that tracks a count of customers through our sales pipeline. From (by sales location) first interaction to charged sale. So far, I’ve creates a moving 5-day average that averages each task. Below is the DAX formula I’ve created thus far and an example table.
=
CALCULATE (
SUM ( [Daily Count] ),
DATESINPERIOD ( Table1[Date], LASTDATE ( Table1[Date] ), -7, DAY ),
ALLEXCEPT ( Table1, Table1[Sales Location], Table1[Group] )
)
/ 5
Where I’m struggling is being able to come up with a way to exclude weekends and company observed holidays. Additionally, if a holiday falls on a weekday I would like to remove that from the average and go back an additional day (to smooth the trend).
For example, on 11/26/18 (the Monday after Thanksgiving and Black Friday) I would like to average the five business days previous (11/26/18, 11/21-11/19, and 11/16). In the example above, the moving total and average for the previous 5 days should be Intake = 41 (total) 8.2 (average), Appointment = 30 (total) 6 (average), and Sale = 13 (total) and 2.6 (average).
Based on the formula currently, each of these numbers is inaccurate. Is there an easy way to exclude these days?
Side note: I’ve created an ancillary table with all holidays that is related to the sales data that I have.
Thank you for the help!
For this, I'd recommend using a calendar table related to Table1 on the Date column that also has a column IsWorkday with 1 if that day is a workday and 0 otherwise.
Once you have that set up, you can write a measure like this:
Moving Avg =
VAR Last5Workdays =
SELECTCOLUMNS (
TOPN (
5,
FILTER (
DateTable,
DateTable[Date] <= EARLIER ( Table1[Date] )
&& DateTable[IsWorkday] = 1
),
DateTable[Date], DESC
),
"Workday", DateTable[Date]
)
RETURN
CALCULATE (
SUM ( Table1[Daily Count] ),
Table1[Date] IN Last5Workdays
ALLEXCEPT ( Table1, Table1[Sales Location], Table1[Group] ),
)
/ 5
The TOPN function here returns the top 5 rows of the DateTable where each row must be a workday that is less than or equal to the date in your current Table1 row (the EARLIER function refers to the earlier row context that defines the current row).
I then use SELECTCOLUMNS to turn this table into a list by selecting a single column (which I've named Workday). From there, it's basically your measure with the date filter changed a bit.
#alexisolson Thank you for the response here. I was actually able to figure this out over the weekend but forgot to close out the thread (sorry about that! Appreciate your help either way). But I did something fairly similar to what you mentioned above.
I created a date table (CorpCalendar) that was only inclusive of working days. Then I created an index column within the CorpCalendar table to give each row a unique number in ascending order. From there, I linked the CorpCalendar table to my SalesData table by related dates and used the LOOKUPVALUE function to bring the index value over from the CorpCalendar table to the SalesData table. In a separate column I subtracted 4 from the date index value to get an index adjustment column (for a range of five days from the actual date index and the adjustment...if that makes sense). I then added an additional LOOKUPVALUE helper column to match the adjusted date index column to the appropriate working day.Lastly, I then used the following function to get the 5 day rolling average.
=CALCULATE(sum(Combined[Daily Count]),DATESBETWEEN(Combined[Date - Adjusted],Combined[Date - Adjusted (-5)],Combined[Date - Adjusted]),ALLEXCEPT(Combined,Combined[Group]))/5
This is probably more convoluted than necessary, however, it got me to the answer I was looking for. Let me know if this makes sense and if you have any suggestions for future scenarios like this.
Thanks again!

Excel 2010 Dax Onhand Quantity Vs. Last Date Qty

Ive spent the last 2 days trying to get this, and I really just need a few pointers. Im using Excel 2010 w/ Power Pivot and calculating inventories. I am trying to get the amount sold between 2 dates. I recorded the quantity on hand if the item was in stock.
Item # Day Date Qty
Black Thursday 11/6/2014 2
Blue Thursday 11/6/2014 3
Green Thursday 11/6/2014 3
Black Friday 11/7/2014 2
Green Friday 11/7/2014 2
Black Monday 11/10/2014 3
Blue Monday 11/10/2014 4
Green Monday 11/10/2014 3
Is there a way to do this in dax? I may have to go back and calculate the differences for each record in excel, but Id like to avoid that if possible.
Somethings that have made this hard for me.
1) I only record the inventory Mon-Fri. I am not sure this will always be the case so i'd like to avoid a dependency on this being only weekdays.
2) When there is none in stock, I dont have a record for that day
Ive tried, CALCULATE with dateadd and it gave me results nearly right, but it ended up filtering out some of the results. Really was odd, but almost right.
Any Help is appreciated.
Bryan, this may not totally answer your question as there are a couple of things that aren't totally clear to me but it should give you a start and I'm happy to expand my answer if you provide further info.
One 'pattern' you can use involves the TOPN function which when used with the parameter n=1 can return the earliest or latest value from a table that it sorts by dates and can be filtered to be earlier/later than dates specified.
For this example I am using a 'disconnected' date table from which the user would select the two dates required in a slicer or report filter:
=
CALCULATE (
SUM ( inventory[Qty] ),
TOPN (
1,
FILTER ( inventory, inventory[Date] <= MAX ( dates[Date] ) ),
inventory[Date],
0
)
)
In this case the TOPN returns a single row table of the latest date earlier than or equal to the latest date provided. The 1st argument in the TOPN specifies the number of rows, the second the table to use, the 3rd the column to sort on and the 4th says to sort descending.
From here it is straightforward to adapt this for a second measure that finds the value for the latest date before or equal to the earliest date selected (i.e. swap MIN for MAX in MAX(dates[Date])).
Hope this helps.
Jacob
*prettified using daxformatter.com

Resources