Generating multiple sets of n samples from data set where standard deviation of each set is minimized - statistics

I prepared a dataset and later learned that it is skewed.
Assuming a plot of user_count vs score where user_count is the number of users on that particular score.
I have to sample out the total users in multiple samples of size [100<=n<=1000] in such a way that the standard deviation of each created sample is minimized.
How do I do that?
I have tried binning methods like custom binning, quantile, etc. but that is not helpful to me, as by manual binning some of my bins have high SD.
Example:
I created 19 custom bins of interval: .05-.10, .10-.15, ......, .90-.95, >.95
this gives me something like this:
the problem here is: q19 has a high SD.
so, I am trying to figure out a way using which I can create an optimal number of bins automatically with minimum standard deviations.

Related

Identify short signal peaks with rolling time frame in python

I am trying to identify all peaks from my sensor readings data. The smallest peak can be lesser than 10 amplitude and largest can be more than 400 amplitude. The rolling time window is not fixed as one peak can arrive in 6 hours vs second one in another 3 hours. I tried wavelet transform and python peak identification but that is only working for higher peaks. How do I resolve this? Here is signal image link, all peaks in Grey color I am identifying and in blue is my algorithm
Welcome to SO.
It is hard to provide you with a detailed answer without knowing your data's sampling rate and the duration of the peaks. From what I see in your example image they seem all over the place!
I don't think that wavelets will be of any use for your problem.
A recipe that I like to use to despike data is:
Smooth your input data using a median filter (a 11 points median filter generally does the trick for me): smoothed=scipy.signal.medfilt(data, window_len=11)
Compute a noise array by subtracting smoothed from data: noise=data-smoothed
Create a despiked_data array from data:
despiked_data=np.zeros_like(data)
np.copyto(despiked_data, data)
Then every time the noise exceeds a user defined threshold (mythreshold), replace the corresponding value in despiked_data with nan values: despiked_data[np.abs(noise)>mythreshold]=np.nan
You may later interpolate the output despiked_data array but if your intent is simply to identify the spikes, you don't even need to run this extra step.

Maximum log-likelihood from data histogram not data directly

I have a complicated theoretical Probability Density Function (PDF) that I define in mathematica and that depends on some parameters that I need to estimate from comparison with real data. From a big simulation done on a cluster and not my laptop I have acquired a lot of events (over 10^9).
The way I understand things, given that I know what the PDF is I 'just' need to sum the probability that those events appear for a given set of parameters and maximise this quantity by adjusting the parameters.
However, given the number of events I would rather work with something less computer-time consuming and work for example with something easily generated like an histogram of my data. But then how would my log-likelihood estimator work?
Thanks a lot for your answers!

How do I calculate confidence interval with only sample size and confidence level

I'm writing a program that lets users run simulates on a subset of data, and as part of this process, the program allows a user to specify what sample size they want based on confidence level and confidence interval. Assuming a p value of .5 to maximum sample size, and given that I know the population size, I can calculate the sample size. For example, if I have:
Population = 54213
Confidence Level = .95
Confidence Interval = 8
I get Sample Size 150. I use the formula outlined here:
https://www.surveysystem.com/sample-size-formula.htm
What I have been asked to do is reverse the process, so that confidence interval is calculated using a given sample size and confidence level (and I know the population). I'm having a horrible time trying to reverse this equation and was wondering if there is a formula. More importantly, does this seem like an intelligent thing to do? Because this seems like a weird request to me.
I should mention (just to be clear) that the CI is estimated for the mean, not the population. In that case, if we assume the population is normally distributed and that we know the population standard deviation SD, then the CI is estimated as
From this formula you would also get your formula, where you are estimating n.
If the population SD is not known then you need to replace the z-value with a t-value.

A method to find the inconsistency or variation in the data

I am running an experiment (it's an image processing experiment) in which I have a set of paper samples and each sample has a set of lines. For each line in the paper sample, its strength is calculated which is denoted by say 's'. For a given paper sample I have to find the variation amongst the strength values 's'. If the variation is above a certain limit, we have to discard that paper.
1) I started with the Standard Deviation of the values, but the problem I am facing is that for each sample, order of magnitude for s (because of various properties of line like its length, sharpness, darkness etc) might differ and also the calculated Standard Deviations values are also differing a lot in magnitude. So I can't really use this method for different samples.
Is there any way where I can find that suitable limit which can be applicable for all samples.
I am thinking that since I don't have any history of how the strength value should behave,( for a given sample depending on the order of magnitude of the strength value more variation could be tolerated in that sample whereas because the magnitude is less in another sample, there should be less variation in that sample) I first need to find a way of baselining the variation in different samples. I don't know what approaches I could try to get started.
Please note that I have to tell variation between lines within a sample whereas the limit should be applicable for any good sample.
Please help me out.
You seem to have a set of samples. Then, for each sample you want to do two things: 1) compute a descriptive metric and 2) perform outlier detection. Both of these are vast subjects that require some knowledge of the phenomenology and statistics of the underlying problem. However, below are some ideas to get you going.
Compute a metric
Median Absolute Deviation. If your sample strength s has values that can jump by an order of magnitude across a sample then it is understandable that the standard deviation was not a good metric. The standard deviation is notoriously sensitive to outliers. So, try a more robust estimate of dispersion in your data. For example, the MAD estimate uses the median in the underlying computations which is more robust to a large spread in the numbers.
Robust measures of scale. Read up on other robust measures like the Interquartile range.
Perform outlier detection
Thresholding. This is similar to what you are already doing. However, you have to choose a suitable threshold for the metric computed above. You might consider using another robust metric for thresholding the metric. You can compute a robust estimate of their mean (e.g., the median) and a robust estimate of their standard deviation (e.g., 1.4826 * MAD). Then identify outliers as metric values above some number of robust standard deviations above the robust mean.
Histogram Another simple method is to histogram your computed metrics from step #1. This is non-parametric so it doesn't require you to model your data. If can histogram your metric values and then use the top 1% (or some other value) as your threshold limit.
Triangle Method A neat and simple heuristic for thresholding is the triangle method to perform binary classification of a skewed distribution.
Anomaly detection Read up on other outlier detection methods.

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

Resources