90% Confidence ellipsoid of 3 dimensinal data - statistics

i did get to know confidence ellipses during university (but that has been some semesters ago).
In my current project, I'd like to calculate a 3 dimensional confidence ellipse/ellipsoid in which I can set the probability of success to e.g. 90%. The center of the data is shifted from zero.
At the moment i am calculating the variance-covariance matrix of the dataset and from it its eigenvalues and eigenvectors which i then represent as an ellipsoid.
here, however, I am missing the information on the probability of success, which I cannot specify.
What is the correct way to calculate a confidence ellipsoid with e.g. 90% probability of success ?

Related

Gaussian Mixture model log-likelihood to likelihood-Sklearn

I want to calculate the likelihoods instead of log-likelihoods. I know that score gives per sample average log-likelihood and for that I need to multiply score with sample size but the log likelihoods are very large negative numbers such as -38567258.1157 and when I take np.exp(scores) , I get a zero. Any help is appreciated.
gmm=GaussianMixture(covariance_type="diag",n_components=2)
y_pred=gmm.fit_predict(X_test)
scores=gmm.score(X_test)

How does probability come in to play in a kNN algorithm?

kNN seems relatively simple to understand: you have your data points and you draw them in your feature space (in a feature space of dimension 2, its the same as drawing points on a xy plane graph). When you want to classify a new data, you put the new data onto the same feature space, find the nearest k neighbors, and see what their labels are, ultimately taking the label(s) with highest votes.
So where does probability come in to play here? All I am doing to calculating distance between two points and taking the label(s) of the closest neighbor.
For a new test sample you look on the K nearest neighbors and look on their labels.
You count how many of those K samples you have in each of the classes, and divide the counts by the number of classes.
For example - lets say that you have 2 classes in your classifier and you use K=3 nearest neighbors, and the labels of those 3 nearest samples are (0,1,1) - the probability of class 0 is 1/3 and the probability for class 1 is 2/3.

Unsupervised Outlier detection

I have 6 points in each row and have around 20k such rows. Each of these row points are actually points on a curve, the nature of curve of each of the rows is same (say a sigmoidal curve or straight line, etc). These 6 points may have different x-values in each row.I also know a point (a,b) for each row which that curve should pass through. How should I go about in finding the rows which may be anomalous or show an unexpected behaviour than other rows? I was thinking of curve fitting but then I only have 6 points for each curve, all I know is that majority of the rows have same nature of curve, so I can perhaps make a general curve for all the rows and have a distance threshold for outlier detection.
What happens if you just treat the 6 points as a 12 dimensional vector and run any of the usual outlier detection methods such as LOF and LoOP?
It's trivial to see the relationship between Euclidean distance on the 12 dimensional vector, and the 6 Euclidean distances of the 6 points each. So this will compare the similarities of these curves.
You can of course also define a complex distance function for LOF.

Convert GMM-UBM scores to equicalent accuracy percent

I have constructed a GMM-UBM model for the speaker recognition purpose. The output of models adapted for each speaker some scores calculated by log likelihood ratio. Now I want to convert these likelihood scores to equivalent number between 0 and 100. Can anybody guide me please?
There is no straightforward formula. You can do simple things like
prob = exp(logratio_score)
but those might not reflect the true distribution of your data. The computed probability percentage of your samples will not be uniformly distributed.
Ideally you need to take a large dataset and collect statistics on what acceptance/rejection rate do you have for what score. Then once you build a histogram you can normalize the score difference by that spectrogram to make sure that 30% of your subjects are accepted if you see the certain score difference. That normalization will allow you to create uniformly distributed probability percentages. See for example How to calculate the confidence intervals for likelihood ratios from a 2x2 table in the presence of cells with zeroes
This problem is rarely solved in speaker identification systems because confidence intervals is not what you want actually want to display. You need a simple accept/reject decision and for that you need to know the amount of false rejects and accept rate. So it is enough to find just a threshold, not build the whole distribution.

How do you calculate the standard deviation for data which is mainly discrete but has a probability of being continuous?

I’m having some issue with calculating the standard deviation of a game. In the game you can get several different discrete scores. The scores have a fixed probability which is given. There is also a 5% chance that your score is randomly generated. You do not know the distribution of the random variable you are only given the mean and variance.
I’ve calculated the variance of the main game (ignoring the random variable) to be 5.2. The variance of the random variable is 137. From this I get a standard deviation of
sqrt(5.2 + 5% *137) = 3.47
Is this the correct method?

Resources