I am new to working with python and data science in general.
My timeseries consists of 208 weekly observations, spanning from 2019 up until the end of 2022
I have made the following transformations to my data in order to make it stationary.
Firstly, I have standardized the ts by doing the following:
avg, dev = df.mean(), df.std() df1 = (df - avg) / dev
Next, I have taken the second order difference of the ts.
df1_diff = df1.diff() df1_diff2 = df1_diff.diff()
Then, I noticed that my data has heteroscedasticity.
which can be seen in the below plot.
I have taken the annual variance of the differenced ts and used it to divide each observation by the corresponding variance.
yearly_volatility = df_diff_2.groupby(df_diff_2.index.year).std() df_yearly_vol = df_diff_2.index.map(lambda d: yearly_volatility.loc[d.year])
date
2019 1.250138
2020 1.518819
2021 2.006581
2022 2.559896
df_diff_22 = df_diff_2 / df_yearly_vol
I have used the transformed data to feed a SARIMAX model. I have created a training and testing ts and produced a very good fit at the end. Then I tried to forecast the next 14 weeks.
The only thing I am struggling with is finding the way to take those forecasted values back into original scales.
I tried searching in youtube and google for practical examples but none of the things I found were useful.
Do you have any suggestions on how to proceed based on the described transformations.
Related
I have prepared a time series model using FB Prophet for making forecasts. The model forecasts for the coming 30 days and my data ranges from Jan 2019 until Mar 2020 both months inclusive with all the dates filled in. The model has been built specifically for the UK market
I have already taken care of the following:
Seasonality
Holidaying Effect
My question is, that how do I take care of the current COVID-19 situation into the same model? The cases that I am trying to forecast are also dependent on the previous data at least from Jan 2020. So in order to forecast I need to take into account the current coronavirus situation as well that would impact my forecasts apart from seasonality and holidaying effect.
How should I achieve this?
Haven't tried it but one thing I have in mind is to add covid-19 as a one-time holiday. You can see an example for adding a one-time promo as a pseudo-holiday in the following link: https://towardsdatascience.com/forecasting-in-python-with-facebook-prophet-29810eb57e66
I assume that it can be applied to a negative effect, caudes by covid-19,
I have had the same issue with COVID at my work with sales forecasting. The easy solution for me was to make an additional regressor which indicates the COVID period, and use that in my model. Then my future is not affected by COVID, unless I tell it that it should be.
I am stuck in a statistics assignment, and would really appreciate some qualified help.
We have been given a data set and are then asked to find the 10% with the lowest rate of profit, in order to decide what Profit rate is the maximum in order to be considered for a program.
the data has:
Mean = 3,61
St. dev. = 8,38
I am thinking that i need to find the 10th percentile, and if i run the percentile function in excel it returns -4,71.
However I tried to run the numbers by hand using the z-score.
where z = -1,28
z=(x-μ)/σ
Solving for x
x= μ + z σ
x=3,61+(-1,28*8,38)=-7,116
My question is which of the two methods is the right one? if any at all.
I am thoroughly confused at this point, hope someone has the time to help.
Thank you
This is the assignment btw:
"The Danish government introduces a program for economic growth and will
help the 10 percent of the rms with the lowest rate of prot. What rate
of prot is the maximum in order to be considered for the program given
the mean and standard deviation found above and assuming that the data
is normally distributed?"
The excel formula is giving the actual, empirical 10th percentile value of your sample
If the data you have includes all possible instances of whatever you’re trying to measure, then go ahead and use that.
If you’re sampling from a population and your sample size is small, use a t distribution or increase your sample size. If your sample size is healthy and your data are normally distributed, use z scores.
Short story is the different outcomes suggest the data you’ve supplied are not normally distributed.
Following links were investigated but didn't provide me with the answer I was looking for/fixing my problem: First, Second.
Due to confidentiality issues I cannot post the actual decomposition I can show my current code and give the lengths of the data set if this isn't enough I will remove the question.
import numpy as np
from statsmodels.tsa import seasonal
def stl_decomposition(data):
data = np.array(data)
data = [item for sublist in data for item in sublist]
decomposed = seasonal.seasonal_decompose(x=data, freq=12)
seas = decomposed.seasonal
trend = decomposed.trend
res = decomposed.resid
In a plot it shows it decomposes correctly according to an additive model. However the trend and residual lists have NaN values for the first and last 6 months. The current data set is of size 10*12. Ideally this should work for something as small as only 2 years.
Is this still too small as said in the first link? I.e. I need to extrapolate the extra points myself?
EDIT: Seems that always half of the frequency is NaN on both ends of trend and residual. Same still holds for decreasing size of data set.
According to this Github link another user had a similar question. They 'fixed' this issue. To avoid NaNs an extra parameter can be passed.
decomposed = seasonal.seasonal_decompose(x=data, freq=12, extrapolate_trend='freq')
It will then use a Linear Least Squares to best approximate the values. (Source)
Obviously the information was literally on their documentation and clearly explained but I completely missed/misinterpreted it. Hence I am answering my own question for someone who has the same issue, to save them the adventure I had.
According to the parameter definition below, setting extrapolate_trend other than 0 makes the trend estimation revert to a different estimation method. I faced this issue when I had a few observations for estimation.
extrapolate_trend : int or 'freq', optional
If set to > 0, the trend resulting from the convolution is
linear least-squares extrapolated on both ends (or the single one
if two_sided is False) considering this many (+1) closest points.
If set to 'freq', use `freq` closest points. Setting this parameter
results in no NaN values in trend or resid components.
I've got files with irradiance data measured every minute 24 hours a day.
So if there is a day without any clouds on the sky the data shows a nice continuous bell curves.
When looking for a day without any clouds in the data I always plotted month after month with gnuplot and checked for nice bell curves.
I was wondering If there's a python way to check, if the Irradiance measurements form a continuos bell curve.
Don't know if the question is too vague but I'm simply looking for some ideas on that quest :-)
For a normal distribution, there are normality tests.
In short, we abuse some knowledge we have of what normal distributions look like to identify them.
The kurtosis of any normal distribution is 3. Compute the kurtosis of your data and it should be close to 3.
The skewness of a normal distribution is zero, so your data should have a skewness close to zero
More generally, you could compute a reference distribution and use a Bregman Divergence, to assess the difference (divergence) between the distributions. bin your data, create a histogram, and start with Jensen-Shannon divergence.
With the divergence approach, you can compare to an arbitrary distribution. You might record a thousand sunny days and check if the divergence between the sunny day and your measured day is below some threshold.
Just to complement the given answer with a code example: one can use a Kolmogorov-Smirnov test to obtain a measure for the "distance" between two distributions. SciPy offers a neat interface for this, called kstest:
from scipy import stats
import numpy as np
data = np.random.normal(size=100) # Our (synthetic) dataset
D, p = stats.kstest(data, "norm") # Perform a one-sided Kolmogorov-Smirnov test
In the above example, D denotes the distance between our data and a Gaussian normal (norm) distribution (smaller is better), and p denotes the corresponding p-value. Other distributions can be similarly tested by substituting norm with those implemented in scipy.stats.
Hey i am trying to calculate a cosinor analysis in statistica but am at a loss as to how to do so. I need to calculate the MESOR, AMPLITUDE, and ACROPHASE of ciracadian rhythm data.
http://www.wepapers.com/Papers/73565/Cosinor_analysis_of_accident_risk_using__SPSS%27s_regression_procedures.ppt
there is a link that shows how to do it, the formulas and such, but it has not given me much help. Does anyone know the code for it, either in statistica or SPSS??
I really need to get this done because it is for an important paper
I don't have SPSS or Statistica, so I can't tell you the exact "push-this-button" kind of steps, but perhaps this will help.
Cosinor analysis is fitting a cosine (or sine) curve with a known period. The main idea is that the non-linear problem of fitting a cosine function can be reduced to a problem that is linear in its parameters if the period is known. I will assume that your period T=24 hours.
You should already have two variables: Time at which the measurement is taken, and Value of the measurement (these, of course, might be called something else).
Now create two new variables: SinTime = sin(2 x pi x Time / 24) and CosTime = cos(2 x pi x Time / 24) - this is desribed on p.11 of the presentation you linked (x is multiplication). Use pi=3.1415 if the exact value is not built-in.
Run multiple linear regression with Value as outcome and SinTime and CosTime as two predictors. You should get estimates of their coefficients, which we will call A and B.
The intercept term of the regression model is the MESOR.
The AMPLITUDE is sqrt(A^2 + B^2) [square root of A squared plus B squared]
The ACROPHASE is arctan(- B / A), where arctan is the inverse function of tan. The last two formulas are from p.14 of the presentation.
The regression model should also give you an R-squared value to see how well the 24 hour circadian pattern fits the data, and an overall p-value that tests for the presence of a circadian component with period 24 hrs.
One can get standard errors on amplitude and phase using standard error propogation formulas, but that is not included in the presentation.