When you're sampling numbers whose precision is much higher than what is practical for your purposes, a naive mode implementation is useless (each sample might very well be unique).
For instance, sampling round-trip time across networked machines. The potential precision of a CPU clock is pretty high. If you only cared about precision down to 1ms or so, and you sampled across a range of pings from Pmax to Pmin, what would be a robust way of measuring the "most common" ping among them?
A couple of possible solutions.
(1) construct a histogram, perhaps using automatically-chosen bins. Then report the bin which contains the most data.
(2) fit a parametric distribution to the data and report the mode of that distribution. The simplest example is to fit a Gaussian distribution and report the mean (which equals the mode for a Gaussian distribution). But there are probably other reasonable choices of distributions, which have other parameters to report. E.g. fit a gamma distribution, and report the mode of that.
Related
I have a list of numbers. Below are some basic statistics:
N > 1000
Max: 9.24
Min: 0.00955
Mean: 1.84932
Median: 0.97696
It seems that the data is right skewed, i.e. many small numbers and a few very large numbers.
I want to find a distribution to generalize these numbers. I think Normal distribution, Gamma distribution, and Laplace distribution all look possible. How do I determine which distribution is the best?
I have to say that I usually do it in the same way you did it, by plotting the data I seeing its shape.
When being more accurate, and only for the normal distribution, I perform the Shapiro Wilk test for normality, which at least will tell me that the null hypotesis was not proven, which means that it was not possible to prove that the date does not follow a normal distribution. Usually, this is more than acceptable in scientific environments.
I know there exists equivalent tests for Laplace and Gamma distributions, although still in newly research like this. Instead, there are many sites that offer the Shapiro Wilk test online, like this one.
With all positive values and the mean being about double the median, your data are definitely skewed right. You can rule out both normal and Laplace because both are symmetric and can go negative.
Scope out some of the many fine alternatives at the Wikipedia distributions page. Make a histogram of your data and check it for similarities in shape to those distributions. Exponentials, log normals, chi-squares, and the gamma family could all give numeric results such as the ones you described, but without knowing anything about the variance/std deviation, whether your data are unimodal or multimodal, or where the mode(s) are, we can only make guesses about a very large pool of possibilities.
I am running an experiment (it's an image processing experiment) in which I have a set of paper samples and each sample has a set of lines. For each line in the paper sample, its strength is calculated which is denoted by say 's'. For a given paper sample I have to find the variation amongst the strength values 's'. If the variation is above a certain limit, we have to discard that paper.
1) I started with the Standard Deviation of the values, but the problem I am facing is that for each sample, order of magnitude for s (because of various properties of line like its length, sharpness, darkness etc) might differ and also the calculated Standard Deviations values are also differing a lot in magnitude. So I can't really use this method for different samples.
Is there any way where I can find that suitable limit which can be applicable for all samples.
I am thinking that since I don't have any history of how the strength value should behave,( for a given sample depending on the order of magnitude of the strength value more variation could be tolerated in that sample whereas because the magnitude is less in another sample, there should be less variation in that sample) I first need to find a way of baselining the variation in different samples. I don't know what approaches I could try to get started.
Please note that I have to tell variation between lines within a sample whereas the limit should be applicable for any good sample.
Please help me out.
You seem to have a set of samples. Then, for each sample you want to do two things: 1) compute a descriptive metric and 2) perform outlier detection. Both of these are vast subjects that require some knowledge of the phenomenology and statistics of the underlying problem. However, below are some ideas to get you going.
Compute a metric
Median Absolute Deviation. If your sample strength s has values that can jump by an order of magnitude across a sample then it is understandable that the standard deviation was not a good metric. The standard deviation is notoriously sensitive to outliers. So, try a more robust estimate of dispersion in your data. For example, the MAD estimate uses the median in the underlying computations which is more robust to a large spread in the numbers.
Robust measures of scale. Read up on other robust measures like the Interquartile range.
Perform outlier detection
Thresholding. This is similar to what you are already doing. However, you have to choose a suitable threshold for the metric computed above. You might consider using another robust metric for thresholding the metric. You can compute a robust estimate of their mean (e.g., the median) and a robust estimate of their standard deviation (e.g., 1.4826 * MAD). Then identify outliers as metric values above some number of robust standard deviations above the robust mean.
Histogram Another simple method is to histogram your computed metrics from step #1. This is non-parametric so it doesn't require you to model your data. If can histogram your metric values and then use the top 1% (or some other value) as your threshold limit.
Triangle Method A neat and simple heuristic for thresholding is the triangle method to perform binary classification of a skewed distribution.
Anomaly detection Read up on other outlier detection methods.
I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.
I have a catalog of 900 applications.
I need to determine how their reliability is distributed as a whole. (i.e. is it normal).
I can measure the reliability of an individual application.
How can I determine the reliability of the group as a whole without measuring each one?
That's a pretty open-ended question! Overall, distribution fitting can be quite challenging and works best with large samples (100's or even 1000's). It's generally better to pick a modeling distribution based on known characteristics of the process you're attempting to model than to try purely empirical fitting.
If you're going to go empirical, for a start you could take a random sample, measure the reliability scores (whatever you're using for that) of your sample, sort them, and plot them vs normal quantiles. If they fall along a relatively straight line the normal distribution is a plausible model, and you can estimate sample mean and variance to parameterize it. You can apply the same idea of plotting vs quantiles from other proposed distributions to see if they are plausible as well.
Watch out for behavior in the tails, in particular. Pretty much by definition the tails occur rarely and may be under-represented in your sample. Like all things statistical, the larger the sample size you can draw on the better your results will be.
I'd also add that my prior belief would be that a normal distribution wouldn't be a great fit. Your reliability scores probably fall on a bounded range, tend to fall more towards one side or the other of that range. If they tend to the high range, I'd predict that they get lopped off at the end of the range and have a long tail to the low side, and vice versa if they tend to the low range.
my knowledge in statistics is minuscule, sorry. I have a large volume of measured amplitudes. In the absence of a signal, the noise is assumed to have a normal distribution. When a signal is present with higher amplitude than the surrounding noise, the shape of the distribution is more tailed on the positive side. I was thinking of using skewness for detection of signal. But the area of higher amplitude (cells in the volume) is rather small compared to the volume itself. So, we are talking of in magnitude of hundreds of cells from a total of some thousands. If the skewness is zero for a normal distribution, how can I extract those cells in my volume which contribute to the non-zero skewness. If say, my skewness value is 0.5, is there a way to drop all cells and keep only those which raised the skewness value. Perhaps I sound unclear but that just shows how little I understand of the topic.
Thanks in advance.
It seems to me that the problem might best be modeled as a mixture model: we have a Gaussian background
B ~ N(0, sigma)
and a signal, about which the poster has not specified a particular model.
If we can assume that the signal also takes the form of one (or possible a mixture of several) Gaussian(s), then Gaussian mixture modelling with the EM algorithm may be a good way to solve it (see Wikipedia).
A good paper in the context of segmentation is this here:
http://www.fil.ion.ucl.ac.uk/~karl/Unified%20segmentation.pdf
If we cannot make such an assumption, I would use a robust regression method to estimate the parameters of the Gaussian noise, where the signal is treated as an outlier, e.g. Least trimmed squares (again see Wikipedia).
The outlier cells can then be found via (Bonferroni-corrected) hypothesis testing, as described e.g. in this paper:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2900857/