statistical cosinor analysis, - statistics

Hey i am trying to calculate a cosinor analysis in statistica but am at a loss as to how to do so. I need to calculate the MESOR, AMPLITUDE, and ACROPHASE of ciracadian rhythm data.
http://www.wepapers.com/Papers/73565/Cosinor_analysis_of_accident_risk_using__SPSS%27s_regression_procedures.ppt
there is a link that shows how to do it, the formulas and such, but it has not given me much help. Does anyone know the code for it, either in statistica or SPSS??
I really need to get this done because it is for an important paper

I don't have SPSS or Statistica, so I can't tell you the exact "push-this-button" kind of steps, but perhaps this will help.
Cosinor analysis is fitting a cosine (or sine) curve with a known period. The main idea is that the non-linear problem of fitting a cosine function can be reduced to a problem that is linear in its parameters if the period is known. I will assume that your period T=24 hours.
You should already have two variables: Time at which the measurement is taken, and Value of the measurement (these, of course, might be called something else).
Now create two new variables: SinTime = sin(2 x pi x Time / 24) and CosTime = cos(2 x pi x Time / 24) - this is desribed on p.11 of the presentation you linked (x is multiplication). Use pi=3.1415 if the exact value is not built-in.
Run multiple linear regression with Value as outcome and SinTime and CosTime as two predictors. You should get estimates of their coefficients, which we will call A and B.
The intercept term of the regression model is the MESOR.
The AMPLITUDE is sqrt(A^2 + B^2) [square root of A squared plus B squared]
The ACROPHASE is arctan(- B / A), where arctan is the inverse function of tan. The last two formulas are from p.14 of the presentation.
The regression model should also give you an R-squared value to see how well the 24 hour circadian pattern fits the data, and an overall p-value that tests for the presence of a circadian component with period 24 hrs.
One can get standard errors on amplitude and phase using standard error propogation formulas, but that is not included in the presentation.

Related

How can r-squared be negative when the correlation between prediction and truth is positive?

Trying to understand how the r-squared (and also explained variance) metrics can be negative (thus indicating non-existant forecasting power) when at the same time the correlation factor between prediction and truth (as well as slope in a linear-regression (regressing truth on prediction)) are positive
R Squared can be negative in a rare scenario.
R squared = 1 – (SSR/SST)
Here, SST stands for Sum of Squared Total which is nothing but how much does the predicted points get varies from the mean of the target variable. Mean is nothing but a regression line here.
SST = Sum (Square (Each data point- Mean of the target variable))
For example,
If we want to build a regression model to predict height of a student with weight as the independent variable then a possible prediction without much effort is to calculate the mean height of all current students and consider it as the prediction.
In the above diagram, red line is the regression line which is nothing but the mean of all heights. This mean calculated without much effort and can be considered as one of the worst method of prediction with poor accuracy. In the diagram itself we can see that the prediction is nowhere near to the original data points.
Now come to SSR,
SSR stands for Sum of Squared Residuals. This residual is calculated from the model which we build from our mathematical approach (Linear regression, Bayesian regression, Polynomial regression or any other approach). If we use a sophisticated approach rather than using a naive approach like mean then our accuracy will obviously increase.
SSR = Sum (Square (Each data point - Each corresponding data point in the regression line))
In the above diagram, let's consider that the blue line indicates a sophisticated model with large mathematical analysis. We can see that it has obviously higher accuracy than the red line.
Now come to the formula,
R Squared = 1- (SSR/SST)
Here,
SST will be large number because it a very poor model (red line).
SSR will be a small number because it is the best model we developed
after much mathematical analysis (blue line).
So, SSR/SST will be a very small number (It will become very small
whenever SSR decreases).
So, 1- (SSR/SST) will be large number.
So we can infer that whenever R Squared goes higher, it means the
model is too good.
This is a generic case but this cannot be applied in many cases where multiple independent variables are present. In the example, we had only one independent variable and one target variable but in real case, we will have 100's of independent variables for a single dependent variable. The actual problem is that, out of 100's of independent variables-
Some variables will have very high correlation with target variable.
Some variables will have very small correlation with target variable.
Also some independent variables will have no correlation at all.
So, RSquared is calculated on an assumption that the average line of the target which is perpendicular line of y axis is the worst fit a model can have at a maximum riskiest case. SST is the squared difference between this average line and original data points. Similarly, SSR is the squared difference between the predicted data points (by the model plane) and original data points.
SSR/SST gives a ratio how SSR is worst with respect to SST. If your model can somewhat build a plane which is a comparatively good than the worst, then in 99% cases SSR<SST. It eventually makes R squared as positive if you substitute it in the equation.
But what if SSR>SST ? This means that your regression plane is worse than the mean line (SST). In this case, R squared will be obviously negative. But it happens only at 1% of cases or smaller.
Answer was originally written in quora by me -
https://qr.ae/pNsLU8
https://qr.ae/pNsLUr

How can MAPE be calculated if some of the actuals in the dataset are 0 values?

I am new to datascience and trying to understand difference evaluation in forecast vs actuals.
Lets say I have actuals:
27.580
25.950
0.000 (Sum = 53.53)
And my predicted values using XGboost are:
29.9
25.4
15.0 (Sum = 70.3)
Is it better to just evaluate based on the sum? example add all actuals minus all predicted? difference = 70.3 - 53.53?
Or is it better to evaluate the difference based on forecasting error techniques like MSE,MAE,RMSE,MAPE?
Since, I read MAPE is most widely accepted, how can it be implemented in cases where 0 is the denominator as can be seen in my actuals above?
Is there a better way to evaluate deviation from actuals or are these the only legitimate methods? My objective is to build more predictive models involving different variables which will give me different predicted values and then choose the one which has the least deviation from the actuals.
If you are to evaluate based on each on each point or the sum, depends on your data and your use case.
For example, if each point represents a time bucket, and the accuracy of each time bucket is important (for example for a production plan), then I would say it is required to evaluate for each bucket.
If you are to measure the accuracy of the sum, then you might as well also forecast based on the sum.
For your question on MAPE then there is no way around the issue you mention here. Your data need to be non-zero for MAPE to be valuable. If you are only to assess one time series then you can use the MAE instead, and then you do not have the issue of the accuracy being infinite/undefined.
But, there are many ways to measure accuracy, and my experience is that it very much depends on your use case and your data set which one that are preferable. See Hyndman's article on accuracy for intermittent demand, for some good points on accuracy measures.
I use MdAPE (Median Absolute Percentage Error) whenever MAPE is not possible to calculate due to 0s

standard error of addition, subtraction, multiplication and ratio

Let's say, I have two random variables,x and y, both of them have n observations. I've used a forecasting method to estimate xn+1 and yn+1, and I also got the standard error for both xn+1 and yn+1. So my question is that what the formula would be if I want to know the standard error of xn+1 + yn+1, xn+1 - yn+1, (xn+1)*(yn+1) and (xn+1)/(yn+1), so that I can calculate the prediction interval for the 4 combinations. Any thought would be much appreciated. Thanks.
Well, the general topic you need to look at is called "change of variables" in mathematical statistics.
The density function for a sum of random variables is the convolution of the individual densities (but only if the variables are independent). Likewise for the difference. In special cases, that convolution is easy to find. For example, for Gaussian variables the density of the sum is also a Gaussian.
For product and quotient, there aren't any simple results, except in special cases. For those, you might as well compute the result directly, maybe by sampling or other numerical methods.
If your variables x and y are not independent, that complicates the situation. But even then, I think sampling is straightforward.

How do i prove that my derived equation and the Monte-Carlo simulation are equivalent?

I have derived and implemented an equation of an expected value.
To show that my code is free of errors i have employed the Monte-Carlo
computation a number of times to show that it converges into the same
value as the equation that i derived.
As I have the data now, how can i visualize this?
Is this even the correct test to do?
Can I give a measure how sure i am that the results are correct?
It's not clear what you mean by visualising the data, but here are some ideas.
If your Monte Carlo simulation is correct, then the Monte Carlo estimator for your quantity is just the mean of the samples. The variance of your estimator (how far away from the 'correct' value the average value will be) will scale inversely proportional to the number of samples you take: so long as you take enough, you'll get arbitrarily close to the correct answer. So, use a moderate (1000 should suffice if it's univariate) number of samples, and look at the average. If this doesn't agree with your theoretical expectation, then you have an error somewhere, in one of your estimates.
You can also use a histogram of your samples, again if they're one-dimensional. The distribution of samples in the histogram should match the theoretical distribution you're taking the expectation of.
If you know the variance in the same way as you know the expectation, you can also look at the sample variance (the mean squared difference between the sample and the expectation), and check that this matches as well.
EDIT: to put something more 'formal' in the answer!
if M(x) is your Monte Carlo estimator for E[X], then as n -> inf, abs(M(x) - E[X]) -> 0. The variance of M(x) is inversely proportional to n, but exactly what it is will depend on what M is an estimator for. You could construct a specific test for this based on the mean and variance of your samples to see that what you've done makes sense. Every 100 iterations, you could compute the mean of your samples, and take the difference between this and your theoretical E[X]. If this decreases, you're probably error free. If not, you have issues either in your theoretical estimate or your Monte Carlo estimator.
Why not just do a simple t-test? From your theoretical equation, you have the true mean mu_0 and your simulators mean,mu_1. Note that we can't calculate mu_1, we can only estimate it using the mean/average. So our hypotheses are:
H_0: mu_0 = mu_1 and H_1: mu_0 does not equal mu_1
The test statistic is the usual one-sample test statistic, i.e.
T = (mu_0 - x)/(s/sqrt(n))
where
mu_0 is the value from your equation
x is the average from your simulator
s is the standard deviation
n is the number of values used to calculate the mean.
In your case, n is going to be large, so this is equivalent to a Normal test. We reject H_0 when T is bigger/smaller than (-3, 3). This would be equivalent to a p-value < 0.01.
A couple of comments:
You can't "prove" that the means are equal.
You mentioned that you want to test a number of values. One possible solution is to implement a Bonferroni type correction. Basically, you reduce your p-value to: p-value/N where N is the number of tests you are running.
Make your sample size as large as possible. Since we don't have any idea about the variability in your Monte Carlo simulation it's impossible to say use n=....
The value of p-value < 0.01 when T is bigger/smaller than (-3, 3) just comes from the Normal distribution.

Looking for an estimation method (data analysis)

Since I have no idea about what I am doing right now, my wording may sound funny. But seriously, I need to learn.
The problem I'm facing is to come up with a method (model) to estimate how a software program works: namely running time and maximal memory usage. What I already have are a large amount of data. This data set gives an overview of how a program works under different conditions, e.g.
<code>
RUN Criterion_A Criterion_B Criterion_C Criterion_D Criterion_E <br>
------------------------------------------------------------------------
R0001 12 2 3556 27 9 <br>
R0002 2 5 2154 22 8 <br>
R0003 19 12 5556 37 9 <br>
R0004 10 3 1556 7 9 <br>
R0005 5 1 556 17 8 <br>
</code>
I have thousands of rows of such data. Now I need to know how I can estimate (forecast) the running time and maximal memory usage if I know all criteria in advance. What I need is an approximation that gives hints (upper limits, or ranges).
I have feeling that it is a typical ??? problem which I don't know. Could you guys show me some hints or give me some ideas (theories, explanations, webpages) or anything that may help. Thanks!
You want a new program that takes as input one or more criteria, then outputs an estimate of the running time or memory usage. This is a machine learning problem.
Your inputs can be listed as a vector of numbers, like this:
input = [ A, B, C, D, E ]
One of the simplest algorithms for this would be a K-nearest neighbor algorithm. The idea behind this is that you'll take your input vector of numbers, and find in your database the vector of numbers that is most similar to your input vector. For example, given this vector of inputs:
input = [ 11, 1.8, 3557, 29, 10 ]
You can assume that the running time and memory should be very similar to the values from this run (originally in your table listed above):
R0001 12 2 3556 27 9
There are several algorithms for calculating the similarity between these two vectors, one simple and intuitive such algorithm is the Euclidean distance. As an example, the Euclidean distance between the input vector and the vector from the table is this:
dist = sqrt( (11-12)^2 + (1.8-2)^2 + (3557-3556)^2 + (27-29)^2 + (9-10)^2 )
dist = 2.6533
It should be intuitively clear that points with lower distance should be better estimates for running time and memory usage, as the distance should describe the similarity between two sets of criteria. Assuming your criteria are informative and well-selected, points with similar criteria should have similar running time and memory usage.
Here's some example code of how to do this in R:
r1 = c(11,1.8,3557,29,10)
r2 = c(12,2.0,3556,27, 9)
print(r1)
print(r2)
dist_r1_r2 = sqrt( (11-12)^2 + (1.8-2)^2 + (3557-3556)^2 + (27-29)^2 + (9-10)^2 )
print(dist_r1_r2)
smarter_dist_r1_r2 = sqrt( sum( (r1 - r2)^2 ) )
print(smarter_dist_r1_r2)
Taking the running time and memory usage of your nearest row is the KNN algorithm for K=1. This approach can be extended to include data from multiple rows by taking a weighted combination of multiple rows from the database, with rows with lower distances to your input vector contributing more to the estimates. Read the Wikipedia page on KNN for more information, especially with regard to data normalization, including contributions from multiple points, and computing distances.
When calculating the difference between these lists of input vectors, you should consider normalizing your data. The rationale for doing this is that a difference of 1 unit between 3557 and 3556 for criteria C may not be equivalent to a difference of 1 between 11 and 12 for criteria A. If your data are normally distributed, you can convert them all to standard scores (or Z-scores) using this formula:
N_trans = (N - mean(N)) / sdev(N)
There is no general solution on the "right" way to normalize data as it depends on the type and range of data that you have, but Z-scores are easy to compute and a good method to try first.
There are many more sophisticated techniques for constructing estimates such as this, including linear regression, support vector regression, and non-linear modeling. The idea behind some of the more sophisticated methods is that you try and develop an equation that describes the relationship of your variables to running time or memory. For example, a simple application might just have one criterion and you can try and distinguish between models such as:
running_time = s1 * A + s0
running_time = s2 * A^2 + s1 * A + s0
running_time = s3 * log(A) + s2 * A^2 + s1 * A + s0
The idea is that A is your fixed criteria, and sN are a list of free parameters that you can tweak until you get a model that works well.
One problem with this approach is that there are many different possible models that have different numbers of parameters. Distinguishing between models that have different numbers of parameters is a difficult problem in statistics, and I don't recommend tackling it during your first foray into machine learning.
Some questions that you should ask yourself are:
Do all of my criteria affect both running time and memory usage? Do some affect only one or the other, and are some useless from a predictive point of view? Answering this question is called feature selection, and is an outstanding problem in machine learning.
Do you have any a priori estimates of how your variables should influence running time or memory usage? For example, you might know that your application uses a sorting algorithm that is N * log(N) in time, which means that you explicitly know the relationship between one criterion and your running time.
Do your rows of measured input criteria paired with running time and memory usage cover all of the plausible use cases for your application? If so, then your estimates will be much better, as machine learning can have a difficult time with data that it's unfamiliar with.
Do the running time and memory of your program depend on criteria that you don't input into your estimation strategy? For example, if you're depending on an external resource such as a web spider, problems with your network may influence running time and memory usage in ways that are difficult to predict. If this is the case, your estimates will have a lot more variance.
If the criterion you would be forecasting for lies within the range of currently known criteria then you should do some more research on the Interpolation process:
In the mathematical subfield of numerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points
If it lies outside your currently known data range research Extrapolation which is less accurate:
In mathematics, extrapolation is the process of constructing new data points outside a discrete set of known data points.
Methods
Interpolation methods for your browsing.
A powerpoint presentation detailing some methods used for Extrapolation.

Resources