How can I use Correlation Coefficient to calculate change in variables - statistics

I calculated a correlation of two dependent variables (size of plot/house vs cost), the correlation stands at 0.87. I want to use this index to measure the increase or decrease in cost if size is increased or decreased. Is it possible using correlation? How?

Correlation only tells us how much two variables are linearly related based on the data we have, but in it does not provide a method to calculate the value of variable given the value of another.
If the variables are linearly related we can predict the actual values that a variable Y will assume when a variable X has some value using Linear Regression:
The idea is to try and fit the data to a linear function, and use it to predict the values:
Y = bX + a
Usually we first discover if two variables are related using a Correlation Coefficient(ex. Pearson Coefficient), then we use a Regression method(ex. Linear) to predict values of a variable of interest given another.
Here is an easy to follow tutorial on Linear Regression in Python with some theory:
https://realpython.com/linear-regression-in-python/#what-is-regression
Here a tutorial on the typical problem of house price prediction:
https://blog.akquinet.de/2017/09/19/predicting-house-prices-on-kaggle-part-i/

Related

Weighted Least Squares vs Monte Carlo comparison

I have an experimental dataset of the following values (y, x1, x2, w), where y is the measured quantity, x1 and x2 are the two independet variables and w is the error of each measurement.
The function I've chosen to describe my data is
These are my tasks:
1) Estimate values of bi
2) Estimate their standard errors
3) Calculate predicted values of f(x1, x2) on a mesh grid and estimate their confidence intervals
4) Calculate predicted values of
and definite integral
and their confidence intervals on a mesh grid
I have several questions:
1) Can all of my tasks be solved by weighted least squares? I've solved task 1-3 using WLS in matrix form by linearisation of the chosen function, but I have no idea, how to solve step №4.
2) I've performed Monte Carlo simulations to estimate bi and their s.e. I've generated perturbated values y'i from normal distribution with mean yi and standard deviation wi. I did this operation N=5000 times. For each perturbated dataset I estimated b'i, and from 5000 values of b'i I calculated mean values and their standard distribution. In the end, bi estimated from Monte-Carlo simulation coincide with those found by WLS. Am I correct, that standard deviations of b'i must be devided by № of Degrees of freedom to obtain standard error?
3) How to estimate confidence bands for predicted values of y using Monte-Carlo approach? I've generated a bunch of perturbated bi values from normal distribution using their BLUE as mean and standard deviations. Then I calculated lots of predicted values of f(x1,x2), found their means and standard deviations. Values of f(x1,x2) found by WLS and MC coincide, but s.d. found from MC are 5-45 order higher than those from WLS. What is the scaling factor that I'm missing here?
4) It seems that some of parameters b are not independent of each other, since there are only 2 independent variables. Should I take this into account in question 3, when I generate bi values? If yes, how can this be done? Should I use Chi-squared test to decide whether generated values of bi are suitable for further calculations, or should they be rejected?
In fact, I not only want to solve tasks I've mentioned earlier, but also I want to compare the two methods for regression analysys. I would appreciate any help and suggestions!

Standardize or subtract constant to data for regression

I am attempting to create a prediction model using multiple linear regression.
One of the predictor variables I want to use is a percentage, so it ranges from 0 - 100. I hypothesize that when it’s <50% there will be a negative effect on the target variable and when >50% a positive effect.
The mean of the predictor variable isn’t exactly 50 in my data set so I am unsure if I centre or Standardize this variable, or just subtract 50 from it to create the split I am looking for.
I am very new to statistics and self teaching myself at the moment, any help is greatly appreciated.

Can I calculate r correlation for catigorical and numerical value?

I have been reading a paper and I have found a table that felt very strange for me. The researchers have calculated r value and related it to a categorical factor. From my knowledge r is a function of correlation between two numerical variables.
Any ideas
enter image description here
When you want to assess the association between a continuous variable and a dichotomous variable (e.g. gender), you would typically use the point-biserial correlation
However, pearson and point-biserial are mathematically equivalent and will give you the same correlation value. So technically their values are correct, just that their naming might be a bit misleading/confusing

Excel Interpolate with logarithmic prediction

Is there a function within Excel to Interpolate while taking into account a logarithmic prediction?
At the moment I am using linear interpolation but would like to find a better way to fill in the blanks if possible.
There's no logarithmic regression or interpolation in Excel, even in the Anlaysis ToolPak. You'll need much more advanced software for that, such as MatLab.
If you're stuck working in Excel... here's a possible mathematical solution:
Rather than working with the raw data x and y, instead try plotting x and a^y, where a is the base. (Or plotting log(x,a) against y.) If you have the correct base a (and there's no vertical offset), you will then have a linear relationship from which you can perform a linear interpolation as normal, then convert the interpolated values back to actual values by taking the log of them.
If you don't know what a is, then you can instead calculate a line of best fit for an arbitrary a, calculate the standard residuals, and then use Problem Solver to modify a until you get the lowest possible standard residuals, at which point you have the best estimate of a.
Similarly if there is a vertical offset b, you'll need to test some variables there that also result in a linear relationship. Plot x against a^(y-b)

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Resources