Weighted Least Squares vs Monte Carlo comparison - statistics

I have an experimental dataset of the following values (y, x1, x2, w), where y is the measured quantity, x1 and x2 are the two independet variables and w is the error of each measurement.
The function I've chosen to describe my data is
These are my tasks:
1) Estimate values of bi
2) Estimate their standard errors
3) Calculate predicted values of f(x1, x2) on a mesh grid and estimate their confidence intervals
4) Calculate predicted values of
and definite integral
and their confidence intervals on a mesh grid
I have several questions:
1) Can all of my tasks be solved by weighted least squares? I've solved task 1-3 using WLS in matrix form by linearisation of the chosen function, but I have no idea, how to solve step №4.
2) I've performed Monte Carlo simulations to estimate bi and their s.e. I've generated perturbated values y'i from normal distribution with mean yi and standard deviation wi. I did this operation N=5000 times. For each perturbated dataset I estimated b'i, and from 5000 values of b'i I calculated mean values and their standard distribution. In the end, bi estimated from Monte-Carlo simulation coincide with those found by WLS. Am I correct, that standard deviations of b'i must be devided by № of Degrees of freedom to obtain standard error?
3) How to estimate confidence bands for predicted values of y using Monte-Carlo approach? I've generated a bunch of perturbated bi values from normal distribution using their BLUE as mean and standard deviations. Then I calculated lots of predicted values of f(x1,x2), found their means and standard deviations. Values of f(x1,x2) found by WLS and MC coincide, but s.d. found from MC are 5-45 order higher than those from WLS. What is the scaling factor that I'm missing here?
4) It seems that some of parameters b are not independent of each other, since there are only 2 independent variables. Should I take this into account in question 3, when I generate bi values? If yes, how can this be done? Should I use Chi-squared test to decide whether generated values of bi are suitable for further calculations, or should they be rejected?
In fact, I not only want to solve tasks I've mentioned earlier, but also I want to compare the two methods for regression analysys. I would appreciate any help and suggestions!

Related

Normalisation or Standardisation for detecting outlier?

When to use min max scaling that is normalisation and when to use standardisation that is using z score for data pre-processing ?
I know that normalisation brings down the range of feature to 0 to 1, and z score bring downs to -3 to 3, but am unsure when to use one of the two technique for detecting the outliers in data?
Let us briefly agree on the terms:
The z-score tells us how many standard deviations a given element of a sample is away from the mean.
The min-max scaling is the method of rescaling a range of measurements the interval [0, 1].
By those definitions, z-score usually spans an interval much larger than [-3,3] if your data follows a long-tailed distribution. On the other hand, a plain normalization does indeed limit the range of the possible outcomes, but will not help you help you to find outliers, since it just bounds the data.
What you need for outlier dedetction are thresholds above or below which you consider a data point to be an outlier. Many programming languages offer Violin plots or Box plots which nicely show your data distribution. The methods behind plots implement a common choice of thresholds:
Box and whisker [of the box plot] plots quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
the minimum and maximum of all of the data [...]
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
All data points outside the whiskers of the box plots are plotted as points and considered outliers.

How can I use Correlation Coefficient to calculate change in variables

I calculated a correlation of two dependent variables (size of plot/house vs cost), the correlation stands at 0.87. I want to use this index to measure the increase or decrease in cost if size is increased or decreased. Is it possible using correlation? How?
Correlation only tells us how much two variables are linearly related based on the data we have, but in it does not provide a method to calculate the value of variable given the value of another.
If the variables are linearly related we can predict the actual values that a variable Y will assume when a variable X has some value using Linear Regression:
The idea is to try and fit the data to a linear function, and use it to predict the values:
Y = bX + a
Usually we first discover if two variables are related using a Correlation Coefficient(ex. Pearson Coefficient), then we use a Regression method(ex. Linear) to predict values of a variable of interest given another.
Here is an easy to follow tutorial on Linear Regression in Python with some theory:
https://realpython.com/linear-regression-in-python/#what-is-regression
Here a tutorial on the typical problem of house price prediction:
https://blog.akquinet.de/2017/09/19/predicting-house-prices-on-kaggle-part-i/

Why Principle Components of Covariance matrix capture maximum variance of the variables?

I am trying to understand PCA, I went through several tutorials. So far I understand that, the eigenvectors of a matrix implies the directions in which vectors are rotated and scaled when multiplied by that matrix, in proportion of the eigenvalues. Hence the eigenvector associated with the maximum Eigen value defines direction of maximum rotation. I understand that along the principle component, the variations are maximum and reconstruction errors are minimum. What I do not understand is:
why finding the Eigen vectors of the covariance matrix corresponds to the axis such that the original variables are better defined with this axis?
In addition to tutorials, I reviewed other answers here including this and this. But still I do not understand it.
Your premise is incorrect. PCA (and eigenvectors of a covariance matrix) certainly don't represent the original data "better".
Briefly, the goal of PCA is to find some lower dimensional representation of your data (X, which is in n dimensions) such that as much of the variation is retained as possible. The upshot is that this lower dimensional representation is an orthogonal subspace and it's the best k dimensional representation (where k < n) of your data. We must find that subspace.
Another way to think about this: given a data matrix X find a matrix Y such that Y is a k-dimensional projection of X. To find the best projection, we can just minimize the difference between X and Y, which in matrix-speak means minimizing ||X - Y||^2.
Since Y is just a projection of X onto lower dimensions, we can say Y = X*v where v*v^T is a lower rank projection. Google rank if this doesn't make sense. We know Xv is a lower dimension than X, but we don't know what direction it points.
To do that, we find the v such that ||X - X*v*v^t||^2 is minimized. This is equivalent to maximizing ||X*v||^2 = ||v^T*X^T*X*v|| and X^T*X is the sample covariance matrix of your data. This is mathematically why we care about the covariance of the data. Also, it turns out that the v that does this the best, is an eigenvector. There is one eigenvector for each dimension in the lower dimensional projection/approximation. These eigenvectors are also orthogonal.
Remember, if they are orthogonal, then the covariance between any two of them is 0. Now think of a matrix with non-zero diagonals and zero's in the off-diagonals. This is a covariance matrix of orthogonal columns, i.e. each column is an eigenvector.
Hopefully that helps bridge the connection between covariance matrix and how it helps to yield the best lower dimensional subspace.
Again, eigenvectors don't better define our original variables. The axis determined by applying PCA to a dataset are linear combinations of our original variables that tend to exhibit maximum variance and produce the closest possible approximation to our original data (as measured by l2 norm).

Convert GMM-UBM scores to equicalent accuracy percent

I have constructed a GMM-UBM model for the speaker recognition purpose. The output of models adapted for each speaker some scores calculated by log likelihood ratio. Now I want to convert these likelihood scores to equivalent number between 0 and 100. Can anybody guide me please?
There is no straightforward formula. You can do simple things like
prob = exp(logratio_score)
but those might not reflect the true distribution of your data. The computed probability percentage of your samples will not be uniformly distributed.
Ideally you need to take a large dataset and collect statistics on what acceptance/rejection rate do you have for what score. Then once you build a histogram you can normalize the score difference by that spectrogram to make sure that 30% of your subjects are accepted if you see the certain score difference. That normalization will allow you to create uniformly distributed probability percentages. See for example How to calculate the confidence intervals for likelihood ratios from a 2x2 table in the presence of cells with zeroes
This problem is rarely solved in speaker identification systems because confidence intervals is not what you want actually want to display. You need a simple accept/reject decision and for that you need to know the amount of false rejects and accept rate. So it is enough to find just a threshold, not build the whole distribution.

How do you calculate the standard deviation for data which is mainly discrete but has a probability of being continuous?

I’m having some issue with calculating the standard deviation of a game. In the game you can get several different discrete scores. The scores have a fixed probability which is given. There is also a 5% chance that your score is randomly generated. You do not know the distribution of the random variable you are only given the mean and variance.
I’ve calculated the variance of the main game (ignoring the random variable) to be 5.2. The variance of the random variable is 137. From this I get a standard deviation of
sqrt(5.2 + 5% *137) = 3.47
Is this the correct method?

Resources