I have some values below which show weekly results over a period of time. On week 19, there was a new process implemented which was supposed to lower the results further.
However, it is clear that there was already a week/week reduction in the results before Week 19. What is the best way to quantify the impact of the 'New Process' versus the rate of improvement before Week 19? I do not want to 'double-count' the effect of the New Process.
Week # Result Status
Week 1 849.27 NA
Week 2 807.59 NA
Week 3 803.59 NA
Week 4 849.7 NA
Week 5 852.19 NA
Week 6 845.06 NA
Week 7 833.77 NA
Week 8 788.46 NA
Week 9 800.32 NA
Week 10 814.66 NA
Week 11 829.21 NA
Week 12 799.49 NA
Week 13 812.24 NA
Week 14 772.62 NA
Week 15 782.13 NA
Week 16 779.66 NA
Week 17 752.86 NA
Week 18 758.39 NA
Week 19 738.47 New Process
Week 20 721.11 New Process
Week 21 642.04 New Process
Week 22 718.72 New Process
Week 23 743.47 New Process
Week 24 709.57 New Process
Week 25 704.48 New Process
Week 26 673.51 New Process
Trying out the example, it looks like the improvement is around 6%, but with a wide confidence interval.
A break in trend doesn't look significant.
The first models below are estimated with OLS with a shift in the constant. In the first case also a shift in trend.
I use Poisson in the last model, since the values of the dependent variable are positive and it estimates an exponential model. The standard errors are correct if we use robust covariance matrix. (We are using Poisson just to estimate an exponential model, we don't assume that the underlying distribution is Poisson).
Notes: It's a pure numpy version, I didn't bother using pandas or patsy formulas. Poisson has optimization problems if some of the explanatory variables are too large.
import numpy as np
import statsmodels.api as sm
data = np.array(
[ 849.27, 807.59, 803.59, 849.7 , 852.19, 845.06, 833.77,
788.46, 800.32, 814.66, 829.21, 799.49, 812.24, 772.62,
782.13, 779.66, 752.86, 758.39, 738.47, 721.11, 642.04,
718.72, 743.47, 709.57, 704.48, 673.51])
nobs = len(data)
trend = np.arange(nobs)
proc = (trend >= 18).astype(int)
x = np.column_stack((np.ones(nobs), trend, proc, (trend - 18)*proc))
res = sm.OLS(data, x).fit()
res.model.exog_names[:] = ['const', 'trend', 'const_diff', 'trend_new']
print(res.summary())
res2 = sm.OLS(data, x).fit()
res2.model.exog_names[:] = ['const', 'trend', 'const_diff']
print(res2.summary())
res4 = sm.OLS(np.log(data), x).fit()
res4.model.exog_names[:] = ['const', 'trend', 'const_diff']
print(res4.summary())
res3 = sm.Poisson(data, x).fit(cov_type='HC0', method='nm', maxiter=5000)
res3 = sm.Poisson(data, x).fit(start_params=res3.params, cov_type='HC0', method='bfgs')
res3.model.exog_names[:] = ['const', 'trend', 'const_diff']
print(res3.summary())
print(np.exp(res3.params))
Calculating the rate of change (i.e change of value per week / month) might give a good idea if he change is accelerating or not.
Another simple way would be to look at a "moving average". Calculate each week the average of the last X weeks. The average would be less sensitive to short lasting changes and "noise".
You may need to try a few values of X (2,3,4) to see what works better.
Plotting a graph of the data (and the moving average) might give you a clearer picture.
If you can load some data that can be downloaded or copy pasted to excel, I can demonstrate the above.
Related
I would like to create a series of simulated values by resampling from empirical observations. The data I have are time series of 1-minute frequency. The simulations should be made on an arbitrary number of days with the same times each day. The twist is, that I need to sample conditional on the time, i.e. when sampling for a time of 8:00, it should be more probable to sample a value around 8:00 (but not limited to 8:00) from the original serie.
I have made a small sketch to show, how the draw-distribution changes depending on which time the a value is simulated for:
I.e. for T=0 it is more probable to draw a value from the actual distribution where the time of day is close to 0 and not probable to draw a value from the original distribution at the time of day of T=n/2 or later, where n is the number of unique timestamps in a day.
Here is a code snippet to generate sample data (I am aware that there is no need to sample conditional on this test data, but it is just to show the structure of the data)
import numpy as np
import pandas as pd
# Create a test data frame (only for illustration)
df = pd.DataFrame(index=pd.date_range(start='2020-01-01', end='2020-12-31', freq='T'))
df['MyValue'] = np.random.normal(0, scale=1, size=len(df))
print(df)
MyValue
2020-01-01 00:00:00 0.635688
2020-01-01 00:01:00 0.246370
2020-01-01 00:02:00 1.424229
2020-01-01 00:03:00 0.173026
2020-01-01 00:04:00 -1.122581
...
2020-12-30 23:56:00 -0.331882
2020-12-30 23:57:00 -2.463465
2020-12-30 23:58:00 -0.039647
2020-12-30 23:59:00 0.906604
2020-12-31 00:00:00 -0.912604
[525601 rows x 1 columns]
# Objective: Create a new time series, where each time the values are
# drawn conditional on the time of the day
I have not been able to find an answer on here, that fits my requirements. All help are appreciated.
I consider this sentence:
need to sample conditional on the time, i.e. when sampling for a time of 8:00, it should be more probable to sample a value around 8:00 (but not limited to 8:00) from the original serie.
Then, assuming the standard deviation is one sixth of the day (given your drawing)
value = np.random.normal(loc=current_time_sample, scale=total_samples/6)
I want to build a measure to get the simple moving average for each day. I have a cube with a single table containing stock market information.
Schema
The expected result is that for each date, this measure shows the closing price average of the X previous days to that date.
For example, for the date 2013-02-28 and for X = 5 days, this measure would show the average closing price for the days 2013-02-28, 2013-02-27, 2013-02-26, 2013-02-25, 2013-02-22. The closing values of those days are summed, and then divided by 5.
The same would be done for each of the rows.
Example dashboard
Maybe it could be achieved just with the function tt..agg.mean() but indicating those X previous days in the scope parameter.
The problem is that I am not sure how to obtain the X previous days for each of the dates dynamically so that I can use them in a measure.
You can compute a sliding average you can use the cumulative scope as referenced in the atoti documentation https://docs.atoti.io/latest/lib/atoti.scope.html#atoti.scope.cumulative.
By passing a tuple containing the date range, in your case ("-5D", None) you will be able to calculate a sliding average over the past 5 days for each date in your data.
The resulting Python code would be:
import atoti as tt
// session setup
...
m, l = cube.measures, cube.levels
// measure setup
...
tt.agg.mean(m["ClosingPrice"], scope=tt.scope.cumulative(l["date"], window=("-5D", None)))
i have a database that contains all flights data for 2019. I want to plot a time series where the y-axis is the number of flights that are delayed ('DEP_DELAY_NEW')and x-axis is the day of the week.
The day of the week column is an integer, i.e. 1 is Monday, 2 is Tuesday etc.
`# only select delayed flights`
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] >0]
delayed_flights['DAY_OF_WEEK'].value_counts()
1 44787
7 40678
2 33145
5 29629
4 27991
3 26499
6 24847
Name: DAY_OF_WEEK, dtype: int64
How do i convert the above into a time series? Additionally how do i change the integer for the 'day of week' into a string (i.e. 'Monday instead of '1'). i couldn't find the answer to those questions in this forum. Thank you
Let's break down the problem into two parts.
Converting the num_delayed columns into a time series
I am not sure what you meant by a time-series here. But the below code would work well for your plotting purpose.
delayed_flights = df_airports_clean[df_airports_clean['ARR_DELAY_NEW'] > 0]
delayed_series = delayed_flights['DAY_OF_WEEK'].value_counts()
delayed_df = pd.DataFrame(delayed_series, columns=['NUM_DELAYS'])
delayed_array = delayed_df['NUM_DELAYS'].values
delayed_array contains the array of delayed flight counts in order.
Converting the day in int into a weekday
You can easily do this by using the calendar module.
>>> import calendar
>>> calendar.day_name[0]
'Monday'
If Monday is not the first day of week, you can use setfirstweekday to change it.
In your case, your day integers are 1-indexed and hence you would need to subtract 1 to make it 0-indexed. Another easy workaround would be to define a dictionary with keys as day_int and values as weekday.
I cannot figure out the approach to this as the principle amount shall change after every year(if calculated annually, which shall be the easiest). Eventual goal is to calculate exact number of years, months and days to earn say 150000 as interest on a deposit of 1000000 at an interest rate of say 6.5%. I have tried but cannot seem to figure out how to increment the year/month/day in the loop. I don't mind if this is down voted because I have not posted any code(Well, they are wrong). This is not as simple as it might seem to beginners here.
It is a pure maths question. Compound interest is calculated as follows:
Ptotal = Pinitial*(1+rate/100)time
where Ptotal is the new total. rate is usually given in percentages so divide by 100; time is in years. You are interested in the difference, though, so use
interest = Pinitial*(1+rate/100)time – Pinitial
instead, which is in Python:
def compound_interest(P,rate,time):
interest = P*(1+rate/100)**time - P
return interest
A basic inversion of this to yield time, given P, r, and target instead, is
time = log((target+Pinitial)/Pinitial)/log(1+rate/100)
and this will immediately return the number of years. Converting the fraction to days is simple – an average year has 365.25 days – but for months you'll have to approximate.
At the bottom, the result is fed back into the standard compound interest formula to show it indeed returns the expected yield.
import math
def reverse_compound_interest(P,rate,target):
time = math.log((target+P)/P)/math.log(1+rate/100)
return time
timespan = reverse_compound_interest(2500000, 6.5, 400000)
print ('time in years',timespan)
years = math.floor(timespan)
months = math.floor(12*(timespan - years))
days = math.floor(365.25*(timespan - years - months/12))
print (years,'y',months,'m',days,'d')
print (compound_interest(2500000, 6.5, timespan))
will output
time in years 2.356815854829652
2 y 4 m 8 d
400000.0
Can we do better? Yes. datetime allows arbitrary numbers added to the current date, so assuming you start earning today (now), you can immediately get your date of $$$:
from datetime import datetime,timedelta
# ... original script here ...
timespan *= 31556926 # the number of seconds in a year
print ('time in seconds',timespan)
print (datetime.now() + timedelta(seconds=timespan))
which shows for me (your target date will differ):
time in years 2.356815854829652
time in seconds 74373863.52648607
2022-08-08 17:02:54.819492
You could do something like
def how_long_till_i_am_rich(investment, profit_goal, interest_rate):
profit = 0
days = 0
daily_interest = interest_rate / 100 / 365
while profit < profit_goal:
days += 1
profit += (investment + profit) * daily_interest
years = days // 365
months = days % 365 // 30
days = days - (months * 30) - (years * 365)
return years, months, days
years, months, days = how_long_till_i_am_rich(2500000, 400000, 8)
print(f"It would take {years} years, {months} months, and {days} days")
OUTPUT
It would take 1 years, 10 months, and 13 days
I have this dataframe:
date amount
2018/01 100
2018/02 105
2018/03 110.25
2018/04 200
As you can see, every month, the amount is increasing by 5% of the previous value. However, every the 4th month (2018/04), this rule does not apply. Instead, it should only past the constant value of 200 for example.
How do I program this in pandas dataframe?
#Lroy_12374 It's not clear what would happen in month's 5-8 and beyond, which would affect how to write the logic. For example:
a) Should month 5 be 5% higher than month 3? OR
b) should it be 5% higher than every fourth month (i.e. April 2018, August 2018, December 2018, April 2019, August 2019, December 2019, etc.)? OR
c) Should it be 5% higher than Month 4 had month 4 not been a constant, which means that Month 5 is 1.05^2*(Month 3).
Also, the definition of a constant is not clear. Literally, will it be 200 or something for every fourth month? Or, will it be a different number that does not follow the pattern of the other 3 months.
I have written some code for scenario c) above:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date' : ['2018/01','2018/02','2018/03',
'2018/04','2018/05','2018/06','2018/07', '2018/08']})
start_amount = 100
constant=200
growth=.05
df['amount'] = np.where((df.index+1)%4 != 0,
start_amount * (1+growth) ** df.index, constant)
df
The key here is to use np.where and implement logic based on the row number, which you can get with df.index. What I am doing in the code above is adding 1 to the row (df.index+1), since python starts counting at 0 and you want logic based on the fourth month. Then, I am using the % symbol, which returns the remainder after dividing, which you want to equal zero if it is the fourth row (i.e. 4/4 = remainder 0). So, basically, where something is not every fourth row you want to multiply by 1.05 (5% increase) RAISED according to the row number, and where it is the fourth row you want to return a constant.
I hope this helps.