statistical test for samples that follow normal distribution, with each sample having multiple measurements? - statistics

I have a set of sample (i = 1 : n), with each one measured for a specific metric 10 times.
The metric mean of the 10 measurements for each sample has a mean mu(i).
I've done dbscan clustering on all the mu, to find out the outlier samples. Now I want to test whether a given outlier is statistically different from the core samples.
The samples appear to follow normal distribution. For each sample, the 10 measurements also appear to follow normal distribution.
If I just use the mu(i) as the metric for each sample, I can easily calculate Z-score and p-value based on normal distribution. My question is, how do I make use of the 10 measurements for each sample to add to my statistical power (is it possible?)
Not very good at statistics, anything would help, thanks in advance...

Related

Which confidence interval is better?

I have 2 confidence intervals, one based on gamma distribution and the other based on normal distribution for a study. I wish to know which CI is used more often in which case.

Predictive Distribution of Time series with Uncertain Future Values

In machine Learning, and especially in Turning Point Detection Problem, it is important to have the best estimate for the probability distribution function (PDF) of the future samples. Lets say that we have ${x_1, \cdots, x_n}$ as a time series, probably a Gaussian one with $f_{X_1, \cdots, X_n}(x_1, \cdots, x_n)$ as its joint distribution function. We want to estimate the predictive distribution of the next $k$ uncertain samples, having the previous $n$ samples with or wouthout the ARMA/ARIMA assumptions, i.e., we are looking to find the $f_{X_{n+1}, \cdots, X_{n+k}}(x_{n+1}, \cdots, x_{n+k} | x_1, \cdots, x_n)$. With the Normal distribution assumption, what would be the mean and variance of the predictive distribution and how could we estimate them using known $n$ samples.

Verify transmit power to be within certain limits of its expected value over 95% of test measurements

I have a requirement where I have to verify the transmit power out of a device as measured at its connector is within 2 dB of its expected value over 95% of test measurements.
I am using a signal analyzer to analyze the transmitted power. I only get the average power value, min, max and stdDev of the measurements and not the individual power measurements.
Now, the question is how would I verify the "95% thing" using average power, min, max and stdDev. It seems that I can use normal distribution to find the 95% confidence level.
I would appreciate if someone can help me on this.
Thanks in anticipation
The way I'm reading this, it seems you are a statistical beginner, so if I'm wrong there, the rest of this answer will probably be insultingly basic, and I'm sorry.
Anyway, the idea is that if a dataset is normally distributed, and all the observations are independent of one another, then 95% of the data points will fall within 1.96 standard deviations of the mean.
Do you get identical estimates of average power every time you measure, or are there some slight random differences from reading to reading? My guess is that it's the second. If you were to measure the power a whole bunch of times, and each time you plotted your average power value on a histogram, then that histogram of sample means would have the shape of a bell curve. This bell curve of sample means would have its own mean and standard deviation, and if you have thousands or millions of data points going into the calculation of each average power reading, it's not horrible to assume that it is a normal distribution. The explanation for this phenomenon is known as the 'central limit theorem', and I recommend both the Khan academy's presentation of it as well as the wikipedia page on it.
On the other hand, if your average power is the mean of some small number of data points, like for instance n= 5, or n= 30, then assumption of a normal distribution of sample means can be pretty bad. In this case, your 95% confidence interval around the average power goes from qt(0.975,n-1)*SD/sqrt(n) below the average to qt(0.975,n-1)*SD/sqrt(N) above the average, where qt(0.975,n-1) is the 97.5th percentile of the t distribution with n-1 degrees of freedom, and SD is your measured standard deviation.

Check if numbers form bell curve (gauss distribution) Python 3

I've got files with irradiance data measured every minute 24 hours a day.
So if there is a day without any clouds on the sky the data shows a nice continuous bell curves.
When looking for a day without any clouds in the data I always plotted month after month with gnuplot and checked for nice bell curves.
I was wondering If there's a python way to check, if the Irradiance measurements form a continuos bell curve.
Don't know if the question is too vague but I'm simply looking for some ideas on that quest :-)
For a normal distribution, there are normality tests.
In short, we abuse some knowledge we have of what normal distributions look like to identify them.
The kurtosis of any normal distribution is 3. Compute the kurtosis of your data and it should be close to 3.
The skewness of a normal distribution is zero, so your data should have a skewness close to zero
More generally, you could compute a reference distribution and use a Bregman Divergence, to assess the difference (divergence) between the distributions. bin your data, create a histogram, and start with Jensen-Shannon divergence.
With the divergence approach, you can compare to an arbitrary distribution. You might record a thousand sunny days and check if the divergence between the sunny day and your measured day is below some threshold.
Just to complement the given answer with a code example: one can use a Kolmogorov-Smirnov test to obtain a measure for the "distance" between two distributions. SciPy offers a neat interface for this, called kstest:
from scipy import stats
import numpy as np
data = np.random.normal(size=100) # Our (synthetic) dataset
D, p = stats.kstest(data, "norm") # Perform a one-sided Kolmogorov-Smirnov test
In the above example, D denotes the distance between our data and a Gaussian normal (norm) distribution (smaller is better), and p denotes the corresponding p-value. Other distributions can be similarly tested by substituting norm with those implemented in scipy.stats.

A method to find the inconsistency or variation in the data

I am running an experiment (it's an image processing experiment) in which I have a set of paper samples and each sample has a set of lines. For each line in the paper sample, its strength is calculated which is denoted by say 's'. For a given paper sample I have to find the variation amongst the strength values 's'. If the variation is above a certain limit, we have to discard that paper.
1) I started with the Standard Deviation of the values, but the problem I am facing is that for each sample, order of magnitude for s (because of various properties of line like its length, sharpness, darkness etc) might differ and also the calculated Standard Deviations values are also differing a lot in magnitude. So I can't really use this method for different samples.
Is there any way where I can find that suitable limit which can be applicable for all samples.
I am thinking that since I don't have any history of how the strength value should behave,( for a given sample depending on the order of magnitude of the strength value more variation could be tolerated in that sample whereas because the magnitude is less in another sample, there should be less variation in that sample) I first need to find a way of baselining the variation in different samples. I don't know what approaches I could try to get started.
Please note that I have to tell variation between lines within a sample whereas the limit should be applicable for any good sample.
Please help me out.
You seem to have a set of samples. Then, for each sample you want to do two things: 1) compute a descriptive metric and 2) perform outlier detection. Both of these are vast subjects that require some knowledge of the phenomenology and statistics of the underlying problem. However, below are some ideas to get you going.
Compute a metric
Median Absolute Deviation. If your sample strength s has values that can jump by an order of magnitude across a sample then it is understandable that the standard deviation was not a good metric. The standard deviation is notoriously sensitive to outliers. So, try a more robust estimate of dispersion in your data. For example, the MAD estimate uses the median in the underlying computations which is more robust to a large spread in the numbers.
Robust measures of scale. Read up on other robust measures like the Interquartile range.
Perform outlier detection
Thresholding. This is similar to what you are already doing. However, you have to choose a suitable threshold for the metric computed above. You might consider using another robust metric for thresholding the metric. You can compute a robust estimate of their mean (e.g., the median) and a robust estimate of their standard deviation (e.g., 1.4826 * MAD). Then identify outliers as metric values above some number of robust standard deviations above the robust mean.
Histogram Another simple method is to histogram your computed metrics from step #1. This is non-parametric so it doesn't require you to model your data. If can histogram your metric values and then use the top 1% (or some other value) as your threshold limit.
Triangle Method A neat and simple heuristic for thresholding is the triangle method to perform binary classification of a skewed distribution.
Anomaly detection Read up on other outlier detection methods.

Resources