i have a database that contains all flights data for 2019. I want to plot a time series where the y-axis is the number of flights that are delayed ('DEP_DELAY_NEW')and x-axis is the day of the week.
The day of the week column is an integer, i.e. 1 is Monday, 2 is Tuesday etc.
`# only select delayed flights`
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] >0]
delayed_flights['DAY_OF_WEEK'].value_counts()
1 44787
7 40678
2 33145
5 29629
4 27991
3 26499
6 24847
Name: DAY_OF_WEEK, dtype: int64
How do i convert the above into a time series? Additionally how do i change the integer for the 'day of week' into a string (i.e. 'Monday instead of '1'). i couldn't find the answer to those questions in this forum. Thank you
Let's break down the problem into two parts.
Converting the num_delayed columns into a time series
I am not sure what you meant by a time-series here. But the below code would work well for your plotting purpose.
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] > 0]
delayed_series = delayed_flights['DAY_OF_WEEK'].value_counts()
delayed_df = pd.DataFrame(delayed_series, columns=['NUM_DELAYS'])
delayed_array = delayed_df['NUM_DELAYS'].values
delayed_array contains the array of delayed flight counts in order.
Converting the day in int into a weekday
You can easily do this by using the calendar module.
>>> import calendar
>>> calendar.day_name[0]
'Monday'
If Monday is not the first day of week, you can use setfirstweekday to change it.
In your case, your day integers are 1-indexed and hence you would need to subtract 1 to make it 0-indexed. Another easy workaround would be to define a dictionary with keys as day_int and values as weekday.
I'm new to Stackoverflow and fairly fresh with Python (some 5 months give or take), so apologies if I'm not explaining this too clearly!
I want to build up a historic trend of the average age of outstanding incidents on a daily basis.
I have two dataframes.
df1 contains incident data going back 8 years, with the two most relevant columns being "opened_at" and "resolved_at" which contains datetime values.
df2 contains a column called date with the full date range from 2012-06-13 to now.
The goal is to have df2 contain the number of outstanding incidents on each date (as of 00:00:00) and the average age of all those deemed outstanding.
I know it's possible to get all rows that exist between two dates, but I believe I want the opposite and find where each date row in df2 exists between dates in opened_at and resolved_at in df1
(It would be helpful to have some example code containing an anonymized/randomized short extract of your data to try solutions on)
This is unlikely to be the most efficient solution, but I believe you could do:
df2["incs_open"] = 0 # Ensure the column exists
for row_num in range(df2.shape[0]):
df2.at[row_num, "incs_open"] = sum(
(df1["opened_at"] < df2.at[row_num, "date"]) &
(df2.at[row_num, "date"] < df1["opened_at"])
)
(This assumes you haven't set an index on the data frame other than the default one)
For the second part of your question, the average age, I believe you can calculate that in the body of the loop like this:
open_incs = (df1["opened_at"] < df2.at[row_num, "date"]) & \
(df2.at[row_num, "date"] < df1["opened_at"])
ages_of_open_incs = df2.at[row_num, "date"] - df1.loc[open_incs, "opened_at"]
avg_age = ages_of_open_incs.mean()
You'll hit some weirdnesses about rounding and things. If an incident was opened last night at 3am, what is its age "today" -- 1 day, 0 days, 9 hours (if we take noon as the point to count from), etc. -- but I assume once you've got code that works you can adjust that to taste.
I can create a list of datetimes between 1994 and 2020 as follows:
from datetime import datetime, timedelta
# Create datetime for plotting
start_date = datetime(1994,1,1)
start_date_yr = start_date.year
time = [start_date + timedelta(days=x) for x in range(9553)]
'time' is a list, and is useful for plotting my 'y' data as a function of time.
My 'y' data is a pandas series with dimension (9553,) containing some NaNs.
However, I want to plot my 'y' data as a function of day of the year, or month, or year. In MATLAB, I would use the function 'datevec' to get these corresponding years, months, days with same dimension (9553,) from variable 'time'.
I want to bin my 'y' data to get the annual cycle (either for each day of the year or each month), and the yearly averages (using all data corresponding to a given year).
How can I obtain a time array (datetime, year, month, day) with dimension (9553,), and how can I bin my 'y' data?
Make a list of tuples:
[(datetime1, year1, month1, day1), (datetime2, year2, month2, day2), (datetime3, year3, month3, day3) ]
mydates = []
for date in time:
mydates.append(tuple((t, t.strftime('%Y'), t.strftime('%m'), t.strftime('%d'))))
I have the following data in a Pandas df:
index;Aircraft_Registration;issue;Leg_Number;Departure_Time;Departure_Date;Arrival_Time;Arrival_Date;Departure_Airport;Arrival_Airport
0;XXA;0;QQ464;01:07:00;2013-12-01;03:33:00;2013-12-01;JFK;AMS
1;XXA;0;QQQ445;06:08:00;2013-12-01;12:02:00;2013-12-01;AMS;CPT
2;XXA;0;QQQ446;13:04:00;2013-12-01;13:13:00;2013-12-01;JFK;SID
3;XXA;0;QQ446;14:17:00;2013-12-01;20:15:00;2013-12-01;SID;FRA
4;XXA;0;QQ453;02:02:00;2013-12-02;13:09:00;2013-12-02;JFK;BJL
5;XXA;0;QQ150;05:47:00;2018-12-03;12:37:00;2018-03-03;KAO;AMS
6;XXA;0;QQ457;15:09:00;2018-11-03;17:51:00;2018-03-03;AMS;AGP
7;XXA;0;QQ457;08:34:00;2018-12-03;22:47:00;2018-03-03;AGP;JFK
8;XXA;0;QQ458;03:34:00;2018-12-03;23:59:00;2018-03-03;ATL;BJL
9;XXA;0;QQ458;06:26:00;2018-10-04;07:01:00;2018-03-04;BJL;AMS
I want to group this data on the month ignoring the year so ideally would end up with 12 new dataframes each representing the events of that months ignoring the year.
I tried the following:
sort = list(df.groupby(pd.Grouper(freq='M', key='Departure_Date')))
This results in a list containing a data frame for each month and year, in this case yielding 60 lists of which many are empty since there is no data for that month.
My expected result is a list containing 12 dataframes, one for each month (January, Februari etc.)
I think need dt.month for 1-12 months or dt.strftime for January-December:
sort = list(df.groupby(df['Departure_Date'].dt.month))
Or:
sort = list(df.groupby(df['Departure_Date'].dt.strftime('%B')))
I have an expense file that I am trying to read in and from this file create a daily log. A small subset of the file that extends over years is shown below, for a few days in January 2015.
Date,Checking_Debit,Checking_Addition,Savings_Debit,Savings_Addition
2015-01-07,342.1,0.0,0.0,0.0
2015-01-07,981.0,0.0,0.0,0.0
2015-01-07,3185.0,0.0,0.0,0.0
2015-01-05,55.0,0.0,0.0,0.0
2015-01-05,75.0,0.0,0.0,0.0
2015-01-03,287.0,0.0,0.0,0.0
2015-01-02,64.8,0.0,0.0,0.0
2015-01-02,75.0,0.0,0.0,75.0
2015-01-02,1280.0,0.0,0.0,0.0
2015-01-02,245.0,0.0,0.0,0.0
2015-01-01,45.0,0.0,0.0,0.0
In my code I start with the variables checking_start and savings_start that contain the start values of the checking and savings account. I would like to give the code a start date and an end date and have the code iterate through each day, see if there was an expense on that day and subtract the checking and savings debits and add the checking and savings additions. If there were no expenses on that day it should keep the accounts at the same value as the previous day. In addition, I am trying to constrain myself to Pandas data frames in the implementation. So far my code looks like this.
import pandas as pd
from date time import date
check_start = 8500.0
savings_start = 4000.0
start_date = date(2017, 1, 1)
end_date = date(2017, 1, 8)
df = pd.read_csv(file_name.csv, dtype={'Date': str, 'Checking_Debit': float,
'Checking_Addition': float,
'Savings_Debit': float,
'Savings_Addition': float})
In a Pythonic format with the Pandas module, how do I walk through from the start date to the end date, one day at a time, then see if there is an expense or expenses on those date and then subtract that from the checking and savings. At the end I should have an array for the value of the checking account on each date and the same for the savings account on that day.
The result should be arrays written into another .csv file with the following format.
Date,Checking,Savings
2017-01-07,1865.1,3925.0
2017-01-06,6373.2,3925.0
2017-01-05,6373.2,3925.0
2017-01-04,6503.2,3925.0
2017-01-03,6503.2,3925.0
2017-01-02,6790.2,3925.0
2017-01-01,8455.0,4000.0
Start by reading the data that you provided and identifying the date column in data with it
import pandas as pd
df = pd.read_csv(r"dat.csv", parse_dates=[0],dtype={'Checking_Debit': float,
'Checking_Addition': float,
'Savings_Debit': float,
'Savings_Addition': float})
Set Date as index for better data manipulation.
df = df.set_index("Date")
Initialize all the variables for the loop
check_start = 8500.0
savings_start = 4000.0
start_date = pd.to_datetime('2015/1/1')
end_date = pd.to_datetime('2015/1/8')
delta = pd.Timedelta('1 days') # time that needs to be added to start date
Now group the expense data w.r.t to each date
grp_df = df.groupby('Date').sum()
Now we will do while loop for create expense report for each day
expense_report = []
while start_date<=end_date:
if start_date in df.index:
savings_start += (grp_df.loc[start_date,"Savings_Addition"]-grp_df.loc[start_date,"Savings_Debit"])
check_start += (grp_df.loc[start_date,"Checking_Addition"]-grp_df.loc[start_date,"Checking_Debit"])
expense_report.append([start_date,check_start,savings_start])
elif start_date not in df.index:
expense_report.append([start_date,check_start,savings_start])
start_date += delta
convert expense_report list to pandas Dataframe
df_exp_rpt = pd.DataFrame(expense_report,columns=["Date","Checking","Savings"])
print(df_exp_rpt)
Date Checking Savings
0 2015-01-01 8455.0 4000.0
1 2015-01-02 6790.2 4075.0
2 2015-01-03 6503.2 4075.0
3 2015-01-04 6503.2 4075.0
4 2015-01-05 6373.2 4075.0
5 2015-01-06 6373.2 4075.0
6 2015-01-07 1865.1 4075.0
7 2015-01-08 1865.1 4075.0
You can save to csv by
df_exp_rpt.to_csv("filename.csv")
Note: The saving column values are 4075 instead of 3925.0 because you have 75 value in saving_addition column in your original data