Importance of normality for process capability when far from spec - outliers - statistics

I have a dataset (n=15) from product testing that has an outlier (force = 2.7lb) that makes it so the data fails normality testing (p=0.012), which (as I understand it) prevents the use of capability analysis. The average and standard deviation from the dataset (including the outlier) is 4.97 lb and 0.77 lb respectively, and without the outlier they are 5.14 lb and 0.46 lb. The spec is that the force needs to be greater than 1 lb. Clearly the product is well above the spec, but since I do not have normality due to the outlier, I don't know what statistical test I can use to demonstrate that the process is capable. Any thoughts?

Related

Can I use Pearson's correlation with non-normally distributed variables?

I have 4 variables I want to create a correlation matrix with. The problem is that two of these variables have non-normally distributed data, one of the assumptions of pearson's correlation. If I run the correlation matrix either way (assuming they are normally distributed, when they aren't), pearson's correlation is significant on a 0.08 level across all correlations. Can I use the pearson correlation either way, given that I have a 0.08 level of significance?
I tried using spearman's correlation, which doesn't assume normal distribution, but the significance level of two of the six correlations was over 0.4 (thus not usable for research).

How can MAPE be calculated if some of the actuals in the dataset are 0 values?

I am new to datascience and trying to understand difference evaluation in forecast vs actuals.
Lets say I have actuals:
27.580
25.950
0.000 (Sum = 53.53)
And my predicted values using XGboost are:
29.9
25.4
15.0 (Sum = 70.3)
Is it better to just evaluate based on the sum? example add all actuals minus all predicted? difference = 70.3 - 53.53?
Or is it better to evaluate the difference based on forecasting error techniques like MSE,MAE,RMSE,MAPE?
Since, I read MAPE is most widely accepted, how can it be implemented in cases where 0 is the denominator as can be seen in my actuals above?
Is there a better way to evaluate deviation from actuals or are these the only legitimate methods? My objective is to build more predictive models involving different variables which will give me different predicted values and then choose the one which has the least deviation from the actuals.
If you are to evaluate based on each on each point or the sum, depends on your data and your use case.
For example, if each point represents a time bucket, and the accuracy of each time bucket is important (for example for a production plan), then I would say it is required to evaluate for each bucket.
If you are to measure the accuracy of the sum, then you might as well also forecast based on the sum.
For your question on MAPE then there is no way around the issue you mention here. Your data need to be non-zero for MAPE to be valuable. If you are only to assess one time series then you can use the MAE instead, and then you do not have the issue of the accuracy being infinite/undefined.
But, there are many ways to measure accuracy, and my experience is that it very much depends on your use case and your data set which one that are preferable. See Hyndman's article on accuracy for intermittent demand, for some good points on accuracy measures.
I use MdAPE (Median Absolute Percentage Error) whenever MAPE is not possible to calculate due to 0s

statistical test for samples that follow normal distribution, with each sample having multiple measurements?

I have a set of sample (i = 1 : n), with each one measured for a specific metric 10 times.
The metric mean of the 10 measurements for each sample has a mean mu(i).
I've done dbscan clustering on all the mu, to find out the outlier samples. Now I want to test whether a given outlier is statistically different from the core samples.
The samples appear to follow normal distribution. For each sample, the 10 measurements also appear to follow normal distribution.
If I just use the mu(i) as the metric for each sample, I can easily calculate Z-score and p-value based on normal distribution. My question is, how do I make use of the 10 measurements for each sample to add to my statistical power (is it possible?)
Not very good at statistics, anything would help, thanks in advance...

How to identify data points that are significantly smaller than the others in a data set?

I have an array of data points of real value. I wish to identify those data points whose values are significantly smaller than others. Are there any well-known algorithms?
For example, the data set can be {0.01, 0.32, 0.45, 0.68, 0.87, 0.95, 1.0}. I can manually tell that 0.01 is significantly smaller than the others. However, I would like to know are there any analysis method for this purpose in statistics area? I tried outlier detection in my data set, but it cannot find any outliers (such as detecting 0.01 as outlier).
I have deleted a segment I wrote explaining the use of zscores for your problem but it was incorrect, I hope the information below is accurate, just in case, use it as a guide only...
The idea is to build a z-distribution from the scores you are testing, minus the test score, and then use that distribution to get a zscore of the test score. Any z greater than 1.96 is unlikely to belong to your test population.
I am not that this works properly because you remove your tests score' influence from the distribution, thus large scores will have inflated zscores because they contribute to a greater variance (the denominator in the zscore equation).
This could be a start till someone with a modicum of expertise sets us straight :)
e.g.
for i = 1:length(data_set)
test_score = data_set(i)
sample_pop = data_set(data_set~=test_score)
sample_mean = mean(sample_pop)
sample_stdev = std(sample_pop)
test_z(i) = (i-sample_mean)/sample_stdev
end
This can be done for higher dimensions by using the dim input for mean.

How can I weight features for better clustering with a very small data set?

I'm working on a program that takes in several (<50) high dimension points in feature space (1000+ dimensions) and performing hierarchical clustering on them by recursively using standard k-clustering.
My problem is that in any one k-clustering pass, different parts of the high dimensional representation are redundant. I know this problem follows under the umbrella of either feature extraction, selection, or weighting.
In general, what does one take into account when selecting a particular feature extraction/selection/weighting algorithm? And specifically, what algorithm would be the best way to prepare my data to clustering in my situation?
Check out this paper:
Witten DM and R Tibshirani (2010) A framework for feature selection in clustering. Journal of the American Statistical Association 105(490): 713-726.
And the related paper COSA by Friedman. They both discuss these issues in depth.
I would suggest a combination of PCA based feature selection and k-means.
Find your principal components and order them by weight. And consume those weights at each depth of you hierarchy.
For example, let's assume you have a cluster hierarchy of four depths abd you obtain component weights like this:
W1: 0.32
W2: 0.20
W3: 0.18
W4: 0.09
...
W1000: 0.00
We want to consume a weight of 1/N from the top for each depth, where N is the depth count. Taking N as 4 here. 0.25 of the first component gets consumed and we reach:
W1: 0.07*
W2: 0.20
W3: 0.18
W4: 0.09
...
W1000: 0.00
New score for the first component becomes 0.32-0.25=0.07. In the second iteration, we consume the top 0.25 again.
W1: 0.00*
W2: 0.02*
W3: 0.18
W4: 0.09
...
W1000: 0.00
The third iteration is:
W1: 0.00
W2: 0.00*
W3: 0.00*
W4: 0.04*
...
W1000: 0.00
And the fourth iteration uses the rest where weights some up to 0.25.
At each iteration we use only features whose weight we consume. For example we only use PC1 and PC2 of the features after KLT on the second iteration, since those are the only components whose weights we consume. Thus, components to cluster for each iteration become:
Iteration 1: PC1
Iteration 2: PC1, PC2
Iteration 3: PC2, PC3, PC4
Iteration 4: PC4, ... PC1000
You may target a final weight consumption that is less than 1.0 and iterate in less amount of weights for this purpose. This is effectively same as filtering out all components beyond your target weight for dimension reduction prior to clustering.
Finally, I don't know if there is a name for this approach. It just feels natural to use PCA for unsupervised problems. You may also try supervised feature selection after the first iteration, since you have cluster labels at hand.

Resources