Calculating coefficient of determination given alpha and mean - statistics

While trying a data science quiz on analyticsvidya.com, I came across the following question:
Consider the following sample: 1,2,3,4,5,6. Alpha = 0.05 and the population mean is equal to 4.5. Calculate the coefficient of determination for this data. The answer given is 0.25
I thought coefficient of determination was a property of a regression. That is, relation between two variables. Is there another interpretation of coefficient of determination that applies to a single series?

Related

Weighted Least Squares vs Monte Carlo comparison

I have an experimental dataset of the following values (y, x1, x2, w), where y is the measured quantity, x1 and x2 are the two independet variables and w is the error of each measurement.
The function I've chosen to describe my data is
These are my tasks:
1) Estimate values of bi
2) Estimate their standard errors
3) Calculate predicted values of f(x1, x2) on a mesh grid and estimate their confidence intervals
4) Calculate predicted values of
and definite integral
and their confidence intervals on a mesh grid
I have several questions:
1) Can all of my tasks be solved by weighted least squares? I've solved task 1-3 using WLS in matrix form by linearisation of the chosen function, but I have no idea, how to solve step №4.
2) I've performed Monte Carlo simulations to estimate bi and their s.e. I've generated perturbated values y'i from normal distribution with mean yi and standard deviation wi. I did this operation N=5000 times. For each perturbated dataset I estimated b'i, and from 5000 values of b'i I calculated mean values and their standard distribution. In the end, bi estimated from Monte-Carlo simulation coincide with those found by WLS. Am I correct, that standard deviations of b'i must be devided by № of Degrees of freedom to obtain standard error?
3) How to estimate confidence bands for predicted values of y using Monte-Carlo approach? I've generated a bunch of perturbated bi values from normal distribution using their BLUE as mean and standard deviations. Then I calculated lots of predicted values of f(x1,x2), found their means and standard deviations. Values of f(x1,x2) found by WLS and MC coincide, but s.d. found from MC are 5-45 order higher than those from WLS. What is the scaling factor that I'm missing here?
4) It seems that some of parameters b are not independent of each other, since there are only 2 independent variables. Should I take this into account in question 3, when I generate bi values? If yes, how can this be done? Should I use Chi-squared test to decide whether generated values of bi are suitable for further calculations, or should they be rejected?
In fact, I not only want to solve tasks I've mentioned earlier, but also I want to compare the two methods for regression analysys. I would appreciate any help and suggestions!

Normalisation or Standardisation for detecting outlier?

When to use min max scaling that is normalisation and when to use standardisation that is using z score for data pre-processing ?
I know that normalisation brings down the range of feature to 0 to 1, and z score bring downs to -3 to 3, but am unsure when to use one of the two technique for detecting the outliers in data?
Let us briefly agree on the terms:
The z-score tells us how many standard deviations a given element of a sample is away from the mean.
The min-max scaling is the method of rescaling a range of measurements the interval [0, 1].
By those definitions, z-score usually spans an interval much larger than [-3,3] if your data follows a long-tailed distribution. On the other hand, a plain normalization does indeed limit the range of the possible outcomes, but will not help you help you to find outliers, since it just bounds the data.
What you need for outlier dedetction are thresholds above or below which you consider a data point to be an outlier. Many programming languages offer Violin plots or Box plots which nicely show your data distribution. The methods behind plots implement a common choice of thresholds:
Box and whisker [of the box plot] plots quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
the minimum and maximum of all of the data [...]
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
All data points outside the whiskers of the box plots are plotted as points and considered outliers.

How can I use Correlation Coefficient to calculate change in variables

I calculated a correlation of two dependent variables (size of plot/house vs cost), the correlation stands at 0.87. I want to use this index to measure the increase or decrease in cost if size is increased or decreased. Is it possible using correlation? How?
Correlation only tells us how much two variables are linearly related based on the data we have, but in it does not provide a method to calculate the value of variable given the value of another.
If the variables are linearly related we can predict the actual values that a variable Y will assume when a variable X has some value using Linear Regression:
The idea is to try and fit the data to a linear function, and use it to predict the values:
Y = bX + a
Usually we first discover if two variables are related using a Correlation Coefficient(ex. Pearson Coefficient), then we use a Regression method(ex. Linear) to predict values of a variable of interest given another.
Here is an easy to follow tutorial on Linear Regression in Python with some theory:
https://realpython.com/linear-regression-in-python/#what-is-regression
Here a tutorial on the typical problem of house price prediction:
https://blog.akquinet.de/2017/09/19/predicting-house-prices-on-kaggle-part-i/

How do you calculate the standard deviation for data which is mainly discrete but has a probability of being continuous?

I’m having some issue with calculating the standard deviation of a game. In the game you can get several different discrete scores. The scores have a fixed probability which is given. There is also a 5% chance that your score is randomly generated. You do not know the distribution of the random variable you are only given the mean and variance.
I’ve calculated the variance of the main game (ignoring the random variable) to be 5.2. The variance of the random variable is 137. From this I get a standard deviation of
sqrt(5.2 + 5% *137) = 3.47
Is this the correct method?

How to know when to use a particular kind of Similarity index? Euclidean Distance vs. Pearson Correlation

What are some of the deciding factors to take into consideration when choosing a similarity index.
In what cases is a Euclidean Distance preferred over Pearson and vice versa?
Correlation is unit independent; if you scale one of the objects ten times, you will get different euclidean distances and same correlation distances. Therefore, correlation metrics is excellent when you want to measure distance between such objects as genes defined by their expression profile.
Often, absolute or squared correlation is used as a distance metrics, because we are more interested in the strength of the relationship than in its sign.
However, correlation is only suitable for highly dimensional data; there is hardly a point of calculating it for two- or three dimensional data points.
Also note that "Pearson distance" is a weighted type of Euclidean distance, and not the "correlation distance" using Pearson correlation coefficient.
It really depends on the application scenario you have in hand. Very briefly, if you are dealing with data where the actual difference in values of attributes is important, go with Euclidean Distance. If you are looking for trend or shape similarity, then go with correlation. Also note, that if you perform z-score normalization in each object, Euclidean Distance behaves similarly to Pearson correlation coefficient. Pearson is not sensitive to linear transformations of the data. There are other types of correlation coefficients that take into account the ranks of the values only, being insensitive to both linear and non linear transformations. Note that the usual use of correlation as dissimilarity is 1 - correlation, which does not respect all the rules for a metric distance.
There are some studies on which proximity measure select on a particular application, for instance:
Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Ivan G. Costa Filho, "Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 99, no. PrePrints, p. 1, , 2013

Resources