This data does look correlated right? - statistics

enter image description here
Hey, this data does look correlated correct? Pearson value says the correlation is only .2; I assume the value is this low because the correlation is not linear. Thanks.

This looks like an exponential decrease, while your x-values are discrete. You could log-transform your y-values and jitter your x-values (adding random numbers of +/- 0.2 or something of this order of magnitude) and then recheck the correlation.

Related

bollinger bands versus statistics: is 1 standard deviation supposed to be split into two halves by it's mean? or top and bottom bands from mean?

I have a question about how bollinger bands are plotted in relation to statistics. In statistics, once a standard deviation is calculated from a mean of a set of numbers, shouldn't interpreting a 1 standard deviation be done so that you divide this number is half, and plot each half above and below the mean? By doing so, you can then determine whether or not it's data points fall within this 1 standard deviation.
Then, correct me if I am wrong, but aren't bollinger bands NOT calculated this way?? Instead, it takes a 1 standard deviation (if you have set it to 1) and plots the WHOLE value both above and below the mean (not splitting in two), thereby doubling the size of this standard-deviation?
Bollinger bands loosely state that that 68% of data falls within the 1st band, 1 standard deviation (loosely because the empirical rule in statistics requires that distributions be normal distributions which most often stock prices are not). However if this empirical rule is from statistics where 1 standard deviation is split in half, that means that applying a 68% probability in to an entire bollinger band is wrong. ??? is this correct??
You can modify the deviation multiples to suite your purpose, you can use 0.5 for example.

90% Confidence ellipsoid of 3 dimensinal data

i did get to know confidence ellipses during university (but that has been some semesters ago).
In my current project, I'd like to calculate a 3 dimensional confidence ellipse/ellipsoid in which I can set the probability of success to e.g. 90%. The center of the data is shifted from zero.
At the moment i am calculating the variance-covariance matrix of the dataset and from it its eigenvalues and eigenvectors which i then represent as an ellipsoid.
here, however, I am missing the information on the probability of success, which I cannot specify.
What is the correct way to calculate a confidence ellipsoid with e.g. 90% probability of success ?

Calculate approximate trimmed 10% mean using percentile data

I have access to my services' latency metrics at all percentiles. I need to calculate the trimmed 10% mean of the service's latency now. Is there a way I can approximate the trimmed 10% mean using just the percentiles data? I understand I can simply calculate the mean using a script for the transactions between the 10th percentile and 90th percentile, but since this data is to be used directionally only, I was wondering if there is an easy hack to guesstimate it as doing it at scale would be expensive.
This is really more suitable for stats.stackexchange.com, but anyway you can approximate the trimmed mean or any other sample statistic given percentiles. From the percentiles, construct the equivalent histogram. Each bar has the width from one percentile to the next, and height equal to the difference of percentiles. (So if you reversed the process and added up the bars, you would get the percentiles again.)
Now with that histogram, calculate the sample statistic. The exact value is an integral. An easy approximation is to generate a number of data from the span of each bar, and then use those data to calculate the sample statistic according to the ordinary formula. The first thing to try is to just generate data equal to the midpoint of each bar, with the number of values in each bin proportional to the bar height.
I don't know a package to do this, but with this description maybe you can look it up, or work out the details.

Normalisation or Standardisation for detecting outlier?

When to use min max scaling that is normalisation and when to use standardisation that is using z score for data pre-processing ?
I know that normalisation brings down the range of feature to 0 to 1, and z score bring downs to -3 to 3, but am unsure when to use one of the two technique for detecting the outliers in data?
Let us briefly agree on the terms:
The z-score tells us how many standard deviations a given element of a sample is away from the mean.
The min-max scaling is the method of rescaling a range of measurements the interval [0, 1].
By those definitions, z-score usually spans an interval much larger than [-3,3] if your data follows a long-tailed distribution. On the other hand, a plain normalization does indeed limit the range of the possible outcomes, but will not help you help you to find outliers, since it just bounds the data.
What you need for outlier dedetction are thresholds above or below which you consider a data point to be an outlier. Many programming languages offer Violin plots or Box plots which nicely show your data distribution. The methods behind plots implement a common choice of thresholds:
Box and whisker [of the box plot] plots quartiles, and the band inside the box is always the second quartile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
the minimum and maximum of all of the data [...]
one standard deviation above and below the mean of the data
the 9th percentile and the 91st percentile
the 2nd percentile and the 98th percentile.
All data points outside the whiskers of the box plots are plotted as points and considered outliers.

Convert GMM-UBM scores to equicalent accuracy percent

I have constructed a GMM-UBM model for the speaker recognition purpose. The output of models adapted for each speaker some scores calculated by log likelihood ratio. Now I want to convert these likelihood scores to equivalent number between 0 and 100. Can anybody guide me please?
There is no straightforward formula. You can do simple things like
prob = exp(logratio_score)
but those might not reflect the true distribution of your data. The computed probability percentage of your samples will not be uniformly distributed.
Ideally you need to take a large dataset and collect statistics on what acceptance/rejection rate do you have for what score. Then once you build a histogram you can normalize the score difference by that spectrogram to make sure that 30% of your subjects are accepted if you see the certain score difference. That normalization will allow you to create uniformly distributed probability percentages. See for example How to calculate the confidence intervals for likelihood ratios from a 2x2 table in the presence of cells with zeroes
This problem is rarely solved in speaker identification systems because confidence intervals is not what you want actually want to display. You need a simple accept/reject decision and for that you need to know the amount of false rejects and accept rate. So it is enough to find just a threshold, not build the whole distribution.

Resources