Pandas, groupby/Grouper on month ignoring the year - python-3.x

I have the following data in a Pandas df:
index;Aircraft_Registration;issue;Leg_Number;Departure_Time;Departure_Date;Arrival_Time;Arrival_Date;Departure_Airport;Arrival_Airport
0;XXA;0;QQ464;01:07:00;2013-12-01;03:33:00;2013-12-01;JFK;AMS
1;XXA;0;QQQ445;06:08:00;2013-12-01;12:02:00;2013-12-01;AMS;CPT
2;XXA;0;QQQ446;13:04:00;2013-12-01;13:13:00;2013-12-01;JFK;SID
3;XXA;0;QQ446;14:17:00;2013-12-01;20:15:00;2013-12-01;SID;FRA
4;XXA;0;QQ453;02:02:00;2013-12-02;13:09:00;2013-12-02;JFK;BJL
5;XXA;0;QQ150;05:47:00;2018-12-03;12:37:00;2018-03-03;KAO;AMS
6;XXA;0;QQ457;15:09:00;2018-11-03;17:51:00;2018-03-03;AMS;AGP
7;XXA;0;QQ457;08:34:00;2018-12-03;22:47:00;2018-03-03;AGP;JFK
8;XXA;0;QQ458;03:34:00;2018-12-03;23:59:00;2018-03-03;ATL;BJL
9;XXA;0;QQ458;06:26:00;2018-10-04;07:01:00;2018-03-04;BJL;AMS
I want to group this data on the month ignoring the year so ideally would end up with 12 new dataframes each representing the events of that months ignoring the year.
I tried the following:
sort = list(df.groupby(pd.Grouper(freq='M', key='Departure_Date')))
This results in a list containing a data frame for each month and year, in this case yielding 60 lists of which many are empty since there is no data for that month.
My expected result is a list containing 12 dataframes, one for each month (January, Februari etc.)

I think need dt.month for 1-12 months or dt.strftime for January-December:
sort = list(df.groupby(df['Departure_Date'].dt.month))
Or:
sort = list(df.groupby(df['Departure_Date'].dt.strftime('%B')))

Related

How to build a time series with Matplotlib

i have a database that contains all flights data for 2019. I want to plot a time series where the y-axis is the number of flights that are delayed ('DEP_DELAY_NEW')and x-axis is the day of the week.
The day of the week column is an integer, i.e. 1 is Monday, 2 is Tuesday etc.
`# only select delayed flights`
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] >0]
delayed_flights['DAY_OF_WEEK'].value_counts()
1 44787
7 40678
2 33145
5 29629
4 27991
3 26499
6 24847
Name: DAY_OF_WEEK, dtype: int64
How do i convert the above into a time series? Additionally how do i change the integer for the 'day of week' into a string (i.e. 'Monday instead of '1'). i couldn't find the answer to those questions in this forum. Thank you
Let's break down the problem into two parts.
Converting the num_delayed columns into a time series
I am not sure what you meant by a time-series here. But the below code would work well for your plotting purpose.
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] > 0]
delayed_series = delayed_flights['DAY_OF_WEEK'].value_counts()
delayed_df = pd.DataFrame(delayed_series, columns=['NUM_DELAYS'])
delayed_array = delayed_df['NUM_DELAYS'].values
delayed_array contains the array of delayed flight counts in order.
Converting the day in int into a weekday
You can easily do this by using the calendar module.
>>> import calendar
>>> calendar.day_name[0]
'Monday'
If Monday is not the first day of week, you can use setfirstweekday to change it.
In your case, your day integers are 1-indexed and hence you would need to subtract 1 to make it 0-indexed. Another easy workaround would be to define a dictionary with keys as day_int and values as weekday.

Finding average age of incidents in a datetime series

I'm new to Stackoverflow and fairly fresh with Python (some 5 months give or take), so apologies if I'm not explaining this too clearly!
I want to build up a historic trend of the average age of outstanding incidents on a daily basis.
I have two dataframes.
df1 contains incident data going back 8 years, with the two most relevant columns being "opened_at" and "resolved_at" which contains datetime values.
df2 contains a column called date with the full date range from 2012-06-13 to now.
The goal is to have df2 contain the number of outstanding incidents on each date (as of 00:00:00) and the average age of all those deemed outstanding.
I know it's possible to get all rows that exist between two dates, but I believe I want the opposite and find where each date row in df2 exists between dates in opened_at and resolved_at in df1
(It would be helpful to have some example code containing an anonymized/randomized short extract of your data to try solutions on)
This is unlikely to be the most efficient solution, but I believe you could do:
df2["incs_open"] = 0 # Ensure the column exists
for row_num in range(df2.shape[0]):
df2.at[row_num, "incs_open"] = sum(
(df1["opened_at"] < df2.at[row_num, "date"]) &
(df2.at[row_num, "date"] < df1["opened_at"])
)
(This assumes you haven't set an index on the data frame other than the default one)
For the second part of your question, the average age, I believe you can calculate that in the body of the loop like this:
open_incs = (df1["opened_at"] < df2.at[row_num, "date"]) & \
(df2.at[row_num, "date"] < df1["opened_at"])
ages_of_open_incs = df2.at[row_num, "date"] - df1.loc[open_incs, "opened_at"]
avg_age = ages_of_open_incs.mean()
You'll hit some weirdnesses about rounding and things. If an incident was opened last night at 3am, what is its age "today" -- 1 day, 0 days, 9 hours (if we take noon as the point to count from), etc. -- but I assume once you've got code that works you can adjust that to taste.

Going from monthly average dataframe to an interpolated daily timeseries

I am interested in taking average monthly values, for each month, and set the monthly average values to be the value on the 15th day of each month (within a daily timeseries).
I start with the following (these are the monthly average values I am given):
m_avg = pd.DataFrame({'Month': ['1.527013956', '1.899169054', '1.669356146','1.44920871', '1.188557788', '1.017035727', '0.950243755', '1.022453993', '1.203913739', '1.369545041','1.441827406','1.48621651']
EDIT: I added one more value to the dataframe so that there are now 12 values.
Next, I want to put each of these monthly values on the 15th day (within each month) for the following time period:
ts = pd.date_range(start='1/1/1950', end='12/31/1999', freq='D')
I know how to pull out the date on 15th day of an already existing daily timeseries by using:
df= df.loc[(df.index.day==15)] # Where df is any daily timeseries
Lastly, I know how to interpolate the values once I have the average monthly values on the 15th day of each month, using:
df.loc[:, ['Col1']] = df.loc[:, ['Col1']].interpolate(method='linear', limit_direction='both', limit=100)
How do I get from the monthly DataFrame to an interpolated daily DataFrame, where I linearly interpolate between the 15th day of each month, which is the monthly value of my original DataFrame by construction?
EDIT:
Your suggestion to use np.tile() was good, but I ended up needing to do this for multiple columns. Instead of np.tile, I used:
index = pd.date_range(start='1/1/1950', end='12/31/1999', freq='MS')
m_avg = pd.concat([month]*49,axis=0).set_index(index)
There may be a better solution out there, but this is working for my needs so far.
Here is one way to do it:
import pandas as pd
import numpy as np
# monthly averages, note these should be cast to float
month = np.array(['1.527013956', '1.899169054', '1.669356146',
'1.44920871', '1.188557788', '1.017035727',
'0.950243755', '1.022453993', '1.203913739',
'1.369545041', '1.441827406', '1.48621651'], dtype='float')
# expand this to 51 years, with the same monthly averages repeating each year
# (obviously not very efficient, probably there are better ways to attack the problem,
# but this was the question)
month = np.tile(month, 51)
# create DataFrame with these values
m_avg = pd.DataFrame({'Month': month})
# set the date index to the desired time period
m_avg.index = pd.date_range(start='1/1/1950', end='12/1/2000', freq='MS')
# shift the index by 14 days to get the 15th of each month
m_avg = m_avg.tshift(14, freq='D')
# expand the index to daily frequency
daily = m_avg.asfreq(freq='D')
# interpolate (linearly) the missing values
daily = daily.interpolate()
# show result
display(daily)
Output:
Month
1950-01-15 1.527014
1950-01-16 1.539019
1950-01-17 1.551024
1950-01-18 1.563029
1950-01-19 1.575034
... ...
2000-12-11 1.480298
2000-12-12 1.481778
2000-12-13 1.483257
2000-12-14 1.484737
2000-12-15 1.486217
18598 rows × 1 columns

Print the first value of a dataframe based on condition, then iterate to the next sequence

I'm looking to perform data analysis on 100-years of climatological data for select U.S. locations (8 in particular), for each day spanning the 100-years. I have a pandas dataFrame set up with columns for Max temperature, Min temperature, Avg temperature, Snowfall, Precip Total, and then Day, Year, and Month values (then, I have an index also based on a date-time value). Right now, I want to set up a for loop to print the first Maximum temperature of 90 degrees F or greater from each year, but ONLY the first. Eventually, I want to narrow this down to each of my 8 locations, but first I just want to get the for loop to work.
Experimented with various iterations of a for loop.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
Unsurprisingly, the output of the loop I provided prints the first 90 degree day period (from the year 1919, the beginning of my data frame) and breaks.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
1919-06-12 00:00:00
That's fine. If I take out the break statement, all of the 90 degree days print, including multiple in the same year. I just want the first value from each year to print. Do I need to set up a second for loop to increment through the year? If I explicitly state the year, ala below, while trying to loop through a counter, the loop still begins in 1919 and eventually reaches an out of bounds index. I know this logic is incorrect.
count = 1919
while count < 2019:
for year in range(len(climate['Year'])):
if (climate[climate['Year']==count]['Max'][year] >=90).all():
print (climate.index[year])
count = count+1
Any input is sincerely appreciated.
You can achieve this without having a second for loop. Assuming the climate dataframe is ordered chronologically, this should do what you want:
current_year = None
for i in range(climate.shape[0]):
if climate['Max'][i] >= 90 and climate['Year'][i] != current_year:
print(climate.index[i])
current_year = climate['Year'][i]
Notice that we're using the current_year variable to keep track of the latest year that we've already printed the result for. Then, in the if check, we're checking if we've already printed a result for the year of the current row in the loop.
That's one way to do it, but I would suggest taking a look at pandas.DataFrame.groupby because I think it fits your use case well. You could get a dataframe that contains the first >=90 max days per year with the following (again assuming climate is ordered chronologically):
climate[climate.Max >= 90].groupby('Year').first()
This just filters the dataframe to only contain the >=90 max days, groups rows from the same year together, and retains only the first row from each group. If you had an additional column Location, you could extend this to get the same except per location per year:
climate[climate.Max >= 90].groupby(['Location', 'Year']).first()

Iterate through CSV and match lines with a specific date

I am parsing a CSV file into a list. Each list item will have a column list[3] which contains a date in the format: mm/dd/yyyy
I need to iterate through the file and extract only the rows which contain a specific date range.
for example, I want to extract all rows for the month of 12/2015. I am having trouble determining how to match the date. Any nudging in the right direction would be helpful.
Thanks.
Method1:
splits your column to month, day and year, converts month and year to integers and then compare and match 12/2015
column3 = "12/31/2015"
month, day, year = column3.split("/")
if int(month) == 12 and int(year) == 2015:
# do your thing
Method2:
parses a datetime string to time object and gets the attributes tm_year and tm_mon, compare them with corresponding month and year.
>>> import time
>>> to = time.strptime("12/03/2015", "%m/%d/%Y")
>>> to.tm_mon
12
>>> to.tm_year
2015

Resources