How to build a time series with Matplotlib - python-3.x

i have a database that contains all flights data for 2019. I want to plot a time series where the y-axis is the number of flights that are delayed ('DEP_DELAY_NEW')and x-axis is the day of the week.
The day of the week column is an integer, i.e. 1 is Monday, 2 is Tuesday etc.
`# only select delayed flights`
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] >0]
delayed_flights['DAY_OF_WEEK'].value_counts()
1 44787
7 40678
2 33145
5 29629
4 27991
3 26499
6 24847
Name: DAY_OF_WEEK, dtype: int64
How do i convert the above into a time series? Additionally how do i change the integer for the 'day of week' into a string (i.e. 'Monday instead of '1'). i couldn't find the answer to those questions in this forum. Thank you

Let's break down the problem into two parts.
Converting the num_delayed columns into a time series
I am not sure what you meant by a time-series here. But the below code would work well for your plotting purpose.
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] > 0]
delayed_series = delayed_flights['DAY_OF_WEEK'].value_counts()
delayed_df = pd.DataFrame(delayed_series, columns=['NUM_DELAYS'])
delayed_array = delayed_df['NUM_DELAYS'].values
delayed_array contains the array of delayed flight counts in order.
Converting the day in int into a weekday
You can easily do this by using the calendar module.
>>> import calendar
>>> calendar.day_name[0]
'Monday'
If Monday is not the first day of week, you can use setfirstweekday to change it.
In your case, your day integers are 1-indexed and hence you would need to subtract 1 to make it 0-indexed. Another easy workaround would be to define a dictionary with keys as day_int and values as weekday.

Related

Using pandas, how can I find out if my customer made a purchase last month or two months ago?

I'm new to python and new to pandas. Of course, if my project used exact dates, I could easily do this, but unfortunately, the date type is a little different, and as you can see, the sign 08 is after the year 1401, which means it is the eighth month of the year 1401.
I currently know that these 3 customers have bought from me this month. But I want to know if these 3 customers bought from me in the previous month or two months ago? If they do, I will give them a discount.
Of course, I should also say that the number 08 is not always fixed, but it could be 09 in the next month. I just want to know if they bought from me 1 month ago or not?
According to the picture, now only Sara should get a discount
You could convert the purchase date to an integer and calculate the number of months from there.
For instance, you have the purchase month 1901/07 and you want to know in 1901/08 how many months the last purchase took place. So you convert both values to integers and subtract them (190108 - 190107 = 1).
import pandas as pd
df = pd.DataFrame({'customer': ['david', 'sara'], 'date': ['1901/03', '1901/07']})
# Manually setting the reference month (190108 for Year 1901 and Month 08)
df['months'] = 190108 - df['date'].replace('/', '', regex=True).astype(int)
# Check if eligible for discount
df['discount'] = df['months'].isin([1, 2])
customer
date
months
discount
0
david
1901/03
5
False
1
sara
1901/07
1
True
To compare with today's month you could to the following:
df['months'] = int(pd.Timestamp.now().strftime('%Y%m'))\
- df['date'].replace('/', '', regex=True).astype(int)

How to build a simple moving average measure

I want to build a measure to get the simple moving average for each day. I have a cube with a single table containing stock market information.
Schema
The expected result is that for each date, this measure shows the closing price average of the X previous days to that date.
For example, for the date 2013-02-28 and for X = 5 days, this measure would show the average closing price for the days 2013-02-28, 2013-02-27, 2013-02-26, 2013-02-25, 2013-02-22. The closing values of those days are summed, and then divided by 5.
The same would be done for each of the rows.
Example dashboard
Maybe it could be achieved just with the function tt..agg.mean() but indicating those X previous days in the scope parameter.
The problem is that I am not sure how to obtain the X previous days for each of the dates dynamically so that I can use them in a measure.
You can compute a sliding average you can use the cumulative scope as referenced in the atoti documentation https://docs.atoti.io/latest/lib/atoti.scope.html#atoti.scope.cumulative.
By passing a tuple containing the date range, in your case ("-5D", None) you will be able to calculate a sliding average over the past 5 days for each date in your data.
The resulting Python code would be:
import atoti as tt
// session setup
...
m, l = cube.measures, cube.levels
// measure setup
...
tt.agg.mean(m["ClosingPrice"], scope=tt.scope.cumulative(l["date"], window=("-5D", None)))

Finding average age of incidents in a datetime series

I'm new to Stackoverflow and fairly fresh with Python (some 5 months give or take), so apologies if I'm not explaining this too clearly!
I want to build up a historic trend of the average age of outstanding incidents on a daily basis.
I have two dataframes.
df1 contains incident data going back 8 years, with the two most relevant columns being "opened_at" and "resolved_at" which contains datetime values.
df2 contains a column called date with the full date range from 2012-06-13 to now.
The goal is to have df2 contain the number of outstanding incidents on each date (as of 00:00:00) and the average age of all those deemed outstanding.
I know it's possible to get all rows that exist between two dates, but I believe I want the opposite and find where each date row in df2 exists between dates in opened_at and resolved_at in df1
(It would be helpful to have some example code containing an anonymized/randomized short extract of your data to try solutions on)
This is unlikely to be the most efficient solution, but I believe you could do:
df2["incs_open"] = 0 # Ensure the column exists
for row_num in range(df2.shape[0]):
df2.at[row_num, "incs_open"] = sum(
(df1["opened_at"] < df2.at[row_num, "date"]) &
(df2.at[row_num, "date"] < df1["opened_at"])
)
(This assumes you haven't set an index on the data frame other than the default one)
For the second part of your question, the average age, I believe you can calculate that in the body of the loop like this:
open_incs = (df1["opened_at"] < df2.at[row_num, "date"]) & \
(df2.at[row_num, "date"] < df1["opened_at"])
ages_of_open_incs = df2.at[row_num, "date"] - df1.loc[open_incs, "opened_at"]
avg_age = ages_of_open_incs.mean()
You'll hit some weirdnesses about rounding and things. If an incident was opened last night at 3am, what is its age "today" -- 1 day, 0 days, 9 hours (if we take noon as the point to count from), etc. -- but I assume once you've got code that works you can adjust that to taste.

Print the first value of a dataframe based on condition, then iterate to the next sequence

I'm looking to perform data analysis on 100-years of climatological data for select U.S. locations (8 in particular), for each day spanning the 100-years. I have a pandas dataFrame set up with columns for Max temperature, Min temperature, Avg temperature, Snowfall, Precip Total, and then Day, Year, and Month values (then, I have an index also based on a date-time value). Right now, I want to set up a for loop to print the first Maximum temperature of 90 degrees F or greater from each year, but ONLY the first. Eventually, I want to narrow this down to each of my 8 locations, but first I just want to get the for loop to work.
Experimented with various iterations of a for loop.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
Unsurprisingly, the output of the loop I provided prints the first 90 degree day period (from the year 1919, the beginning of my data frame) and breaks.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
1919-06-12 00:00:00
That's fine. If I take out the break statement, all of the 90 degree days print, including multiple in the same year. I just want the first value from each year to print. Do I need to set up a second for loop to increment through the year? If I explicitly state the year, ala below, while trying to loop through a counter, the loop still begins in 1919 and eventually reaches an out of bounds index. I know this logic is incorrect.
count = 1919
while count < 2019:
for year in range(len(climate['Year'])):
if (climate[climate['Year']==count]['Max'][year] >=90).all():
print (climate.index[year])
count = count+1
Any input is sincerely appreciated.
You can achieve this without having a second for loop. Assuming the climate dataframe is ordered chronologically, this should do what you want:
current_year = None
for i in range(climate.shape[0]):
if climate['Max'][i] >= 90 and climate['Year'][i] != current_year:
print(climate.index[i])
current_year = climate['Year'][i]
Notice that we're using the current_year variable to keep track of the latest year that we've already printed the result for. Then, in the if check, we're checking if we've already printed a result for the year of the current row in the loop.
That's one way to do it, but I would suggest taking a look at pandas.DataFrame.groupby because I think it fits your use case well. You could get a dataframe that contains the first >=90 max days per year with the following (again assuming climate is ordered chronologically):
climate[climate.Max >= 90].groupby('Year').first()
This just filters the dataframe to only contain the >=90 max days, groups rows from the same year together, and retains only the first row from each group. If you had an additional column Location, you could extend this to get the same except per location per year:
climate[climate.Max >= 90].groupby(['Location', 'Year']).first()

In DAX, can I create a function that calculates the working days between two date/time fields?

There are already plenty of options for calculating the number of workdays between two dates if there are no times involved, but if you leave it as date/time, is there any way to get a number of working days (with decimal remainders) between two points in time in DAX (e.g. Power Query/Power BI)?
Assuming that your start and end times occur on working days, then you should be able to take the time difference between two dates and subtract out the number of non-working days during that period.
You'll want a calendar table to help out. Say,
Dates = CALENDARAUTO()
Then your working days measure might look like this:
WorkingDays =
StartFinish[Finish Date] - StartFinish[Start Date] -
SUMX(
FILTER(Dates,
Dates[Date] > StartFinish[Start Date] &&
Dates[Date] < StartFinish[Finish Date]),
IF(WEEKDAY(Dates[Date]) = 1 || WEEKDAY(Dates[Date]) = 7, 1, 0)
)
If you have an IsWorkDay column in your calendar table (which might include holidays as well as weekends), you can just reference that for the last line instead:
IF(Dates[IsWorkDay], 0, 1)
Note that this approach assumes a working day is 24 hours rather than a more standard 8 hours. You'll have to make some adjustments if you don't want the fractional part to indicate the portion of 24 hours. To switch to a portion of 8 hour work days, just multiply the fractional part by 24/3 = 3.

Resources