Using Bootstrapping when you know that your sample is biased

Using Bootstrapping when you know that your sample is biased - statistics

I have a sample with wage data of which I know it is not representative of the data. To be specific, there are many high-paying wages of pastors but I know that they made up at most like 0.5% of the workforce but like 25% of my sample. Is it consistent to first sample from the pastoral wages to get only a few observations and then use these datapoints with the original dataset (without all the pastoral wages) to conduct the bootstrap?
Or is there a different way consistent with bootstrap assumptions to correct the sample, such that it is more likely representative of the population?

Related

Impulse response analysis

I ran an impulse response analysis on a value weighted stock index and a few variables in python and got the following results:
I am not sure how to interpret these results.
Can anyone please help me out?

You might want to check the book "New introduction to Multiple Time Series Analysis" by Helmut Lutkepohl, 2005, for a slightly dense theory about the method.
In the meantime, a simple way you can interpret your plots is, let's say your variables are VW, SP500, oil, uts, prod, cpi, n3 and usd. They all are parts of the same system; what the impulse response analysis does is, try to assess how much one variable impacts another one independently of the other variables. Therefore, it is a pairwise shock from one variable to another. Your first plot is VW -> VW, this is pretty much an autocorrelation plot. Now, look at the other plots: apparently, SP500 exerts a maximum impact on VW (you can see a peak in the blue line reaching 0.25. The y-axis is given in standard deviations and x-axis in lag-periods. So in your example, SP500 cause a 0.25 change in VW at the lag of whatever is in your x-axis (I can't see from your figure). Similarly, you can see n3 negatively impacting VW at a given period.
There is an interesting link that you probably know and shows an example of the application of Python statsmodels VAR for Impulse Response analysis
I used this method to assess how one variable impact another in a plant-water-atmosphere system, there are some explanations there and also the interpretation of similar plots, take a look:
Use of remote sensing indicators to assess effects of drought and human-induced land degradation on ecosystem health in Northeastern Brazil
Good luck!

What descriptive statistics are commonly used for time-series data?

I have a time-series of weekly usage data and I'm going to attempt to use some statistics to segment the population. Skewness and Kurtosis to may allow me to describe the time-series and group the people in different ways. But I also notice some appear to have saw-tooth patterns, or bimodal patterns, then I don't think these two aforementioned statistics will describe them well. Distance from the mean would tell me who has continual steady usage vs. unpredictable usage.
What descriptive statistics are commonly used for time-series data?
Thanks,

The periodogram and the autocorrelation function are two common sources of information
used to analyse and model time series. You can use this information to compare the series.
In the periodogram you can detect the frequencies at which the estimated spectral density is the highest. This will tell you which series are dominated by cycles of the same frequency.
The autocorrelation function (the time domain counterpart of the periodogram) and the partial autocorrelation function can similarly be used to compare and group the series. Those series with significant autocorrelations at the same lag orders could be grouped together.
You may need to transform the series in order to discern some of this information, for example taking differences to render the data stationary. Alternatively you can select an ARIMA model for each series and compare the characteristics of each model (those characteristics will be pretty much the same as those observed in the autocorrelation functions).

A method to find the inconsistency or variation in the data

I am running an experiment (it's an image processing experiment) in which I have a set of paper samples and each sample has a set of lines. For each line in the paper sample, its strength is calculated which is denoted by say 's'. For a given paper sample I have to find the variation amongst the strength values 's'. If the variation is above a certain limit, we have to discard that paper.
1) I started with the Standard Deviation of the values, but the problem I am facing is that for each sample, order of magnitude for s (because of various properties of line like its length, sharpness, darkness etc) might differ and also the calculated Standard Deviations values are also differing a lot in magnitude. So I can't really use this method for different samples.
Is there any way where I can find that suitable limit which can be applicable for all samples.
I am thinking that since I don't have any history of how the strength value should behave,( for a given sample depending on the order of magnitude of the strength value more variation could be tolerated in that sample whereas because the magnitude is less in another sample, there should be less variation in that sample) I first need to find a way of baselining the variation in different samples. I don't know what approaches I could try to get started.
Please note that I have to tell variation between lines within a sample whereas the limit should be applicable for any good sample.
Please help me out.

You seem to have a set of samples. Then, for each sample you want to do two things: 1) compute a descriptive metric and 2) perform outlier detection. Both of these are vast subjects that require some knowledge of the phenomenology and statistics of the underlying problem. However, below are some ideas to get you going.
Compute a metric
Median Absolute Deviation. If your sample strength s has values that can jump by an order of magnitude across a sample then it is understandable that the standard deviation was not a good metric. The standard deviation is notoriously sensitive to outliers. So, try a more robust estimate of dispersion in your data. For example, the MAD estimate uses the median in the underlying computations which is more robust to a large spread in the numbers.
Robust measures of scale. Read up on other robust measures like the Interquartile range.
Perform outlier detection
Thresholding. This is similar to what you are already doing. However, you have to choose a suitable threshold for the metric computed above. You might consider using another robust metric for thresholding the metric. You can compute a robust estimate of their mean (e.g., the median) and a robust estimate of their standard deviation (e.g., 1.4826 * MAD). Then identify outliers as metric values above some number of robust standard deviations above the robust mean.
Histogram Another simple method is to histogram your computed metrics from step #1. This is non-parametric so it doesn't require you to model your data. If can histogram your metric values and then use the top 1% (or some other value) as your threshold limit.
Triangle Method A neat and simple heuristic for thresholding is the triangle method to perform binary classification of a skewed distribution.
Anomaly detection Read up on other outlier detection methods.

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.

You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

What are the efficient and accurate algorithms to exclude outliers from a set of data?

I have set of 200 data rows(implies a small set of data). I want to carry out some statistical analysis, but before that I want to exclude outliers.
What are the potential algos for the purpose? Accuracy is a matter of concern.
I am very new to Stats, so need help in very basic algos.

Overall, the thing that makes a question like this hard is that there is no rigorous definition of an outlier. I would actually recommend against using a certain number of standard deviations as the cutoff for the following reasons:
A few outliers can have a huge impact on your estimate of standard deviation, as standard deviation is not a robust statistic.
The interpretation of standard deviation depends hugely on the distribution of your data. If your data is normally distributed then 3 standard deviations is a lot, but if it's, for example, log-normally distributed, then 3 standard deviations is not a lot.
There are a few good ways to proceed:
Keep all the data, and just use robust statistics (median instead of mean, Wilcoxon test instead of T-test, etc.). Probably good if your dataset is large.
Trim or Winsorize your data. Trimming means removing the top and bottom x%. Winsorizing means setting the top and bottom x% to the xth and 1-xth percentile value respectively.
If you have a small dataset, you could just plot your data and examine it manually for implausible values.
If your data looks reasonably close to normally distributed (no heavy tails and roughly symmetric), then use the median absolute deviation instead of the standard deviation as your test statistic and filter to 3 or 4 median absolute deviations away from the median.

Start by plotting the leverage of the outliers and then go for some good ol' interocular trauma (aka look at the scatterplot).
Lots of statistical packages have outlier/residual diagnostics, but I prefer Cook's D. You can calculate it by hand if you'd like using this formula from mtsu.edu (original link is dead, this is sourced from archive.org).

You may have heard the expression 'six sigma'.
This refers to plus and minus 3 sigma (ie, standard deviations) around the mean.
Anything outside the 'six sigma' range could be treated as an outlier.
On reflection, I think 'six sigma' is too wide.
This article describes how it amounts to "3.4 defective parts per million opportunities."
It seems like a pretty stringent requirement for certification purposes. Only you can decide if it suits you.

Depending on your data and its meaning, you might want to look into RANSAC (random sample consensus). This is widely used in computer vision, and generally gives excellent results when trying to fit data with lots of outliers to a model.
And it's very simple to conceptualize and explain. On the other hand, it's non deterministic, which may cause problems depending on the application.

Compute the standard deviation on the set, and exclude everything outside of the first, second or third standard deviation.

Here is how I would go about it in SQL Server
The query below will get the average weight from a fictional Scale table holding a single weigh-in for each person while not permitting those who are overly fat or thin to throw off the more realistic average:
select w.Gender, Avg(w.Weight) as AvgWeight
from ScaleData w
join ( select d.Gender, Avg(d.Weight) as AvgWeight,
2*STDDEVP(d.Weight) StdDeviation
from ScaleData d
group by d.Gender
) d
on w.Gender = d.Gender
and w.Weight between d.AvgWeight-d.StdDeviation
and d.AvgWeight+d.StdDeviation
group by w.Gender
There may be a better way to go about this, but it works and works well. If you have come across another more efficient solution, I’d love to hear about it.
NOTE: the above removes the top and bottom 5% of outliers out of the picture for purpose of the Average. You can adjust the number of outliers removed by adjusting the 2* in the 2*STDDEVP as per: http://en.wikipedia.org/wiki/Standard_deviation

If you want to just analyse it, say you want to compute the correlation with another variable, its ok to exclude outliers. But if you want to model / predict, it is not always best to exclude them straightaway.
Try to treat it with methods such as capping or if you suspect the outliers contain information/pattern, then replace it with missing, and model/predict it. I have written some examples of how you can go about this here using R.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string