Excel - How to split data into train and test sets that are equally distributed - excel-formula

I've got a data set (in Excel) that I'm going to import into SAS to undertake some modelling.
I've got a method for randomly splitting my excel dataset (using the =RAND() function), but is there a way (at the splitting stage) to ensure the distribution of the samples is even (other than to keep randomly splitting and testing the distribution until it becomes acceptable)?
Otherwise, if this is best performed in SAS, what is the most efficient approach for testing the sample randomness?
The dataset contains 35 variables, with a mixture of binary, continuous and categorical variables.

In SAS, you can just use proc surveyselect to do this.
proc surveyselect data=sashelp.cars out=cars_out outall samprate=0.7;
run;
data train test;
set cars_out;
if selected then output test;
else output train;
run;
If there is a particular variable[s] you want to make sure the Train and Test sets are balanced on, you can use either strata or control depending on exactly what sort of thing you're talking about. control will simply make an approximate attempt to even things by the control variables (it sorts by the control variable, then pulls every 3rd or whatever, so you get a sort of approximate balance; if you have 2+ control variables it snake-sorts, Asc. then Desc. etc. inside, but that reduces randomness).
If you use strata, it guarantees you the sample rate inside the strata - so if you did:
proc sort data=sashelp.cars out=cars;
by origin;
run;
proc surveyselect data=cars out=cars_out outall samprate=0.7;
strata origin;
run;
(and the final splitting data step is the same) then you'd get 70% of each separate origin pulled (which would end up being 70% of the total, of course).
Which you do depends on what you care about it being balanced by. The more things you do this with, the less balanced it is with everything else, so be cautious; it may be that a simple random sample is the best, especially if you have a good enough N.
If you don't have enough N, then you can use bootstrapping techniques, meaning you take a sample WITH replacement from that 70% and take maybe 100 of those samples, each with a higher N than your original. Then you do your test or whatever on each sample selected, and the variation in those results tells you how you're doing even if your N is not enough to do it in one pass.

This answer has nothing to do with Excel, but with sampling strategy.
First we must construct a criteria that the sample's measure's are "close enough" to the complete dataset.
Say we are interested in the mean and the standard deviation and that the complete population is a set of 10,000 values in column A
we calculate the mean and standard deviation of the complete dataset.
devise a "close enough" criteria for each measure
pick, say, 500 samples
calculate the measures for the sample.
if the measures are "close enough" we are done, otherwise pick another 500.
We need to be careful that the criteria are not too tight; otherwise we may loop forever.

Related

Ideas on filtering out consistent time series data

So I have two subsets of data that represent two situations. The one that look more consistent needs to be filtered out (they are noise) while the one looks random are kept (they are motions). The method I was using was to define a moving window = 10 and whenever the standard deviation of the data within the window was smaller than some threshold, I suppressed them. However, this method could not filter out all "consistent" noise while also hurting the inconsistent one (real motion). I was hoping to use some kinds of statistical models and not machine learning to accomplish this. Any suggestions would be appreciated!
noise
real motion
The Kolmogorov–Smirnov test is used to compare two samples to determine if they come from the same distribution. I realized that real world data would never be uniform. So instead of comparing my noise data against the uniform distribution, I used scipy.stats.ks_2samp function to compare any bursts against one real motion burst. I then muted the motion if the return p-value is significantly small, meaning I can reject the hypothesis that two samples are from the same distribution.

approximate histogram for streaming string values (card catalog algorithm?)

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

Reverse engineer a new set of points from an original set by altering moments, skew, and/or Kurtosis?

I don't know if this is even possible but I'd like to be able to take a set of points, run something on them that calculates the moments, skew and kurtosis values, and have another function that would take those elements and reverse engineer a new set of points using modified values for the moments, skew and/or kurtosis. I already have the analytical function in Delphi Pro 6 which is:
procedure MomentSkewKurtosis(const Data: array of Double;var M1, M2, M3, M4, Skew,Kurtosis: Extended);
I'm looking for a partner function that could return a new Data array after I make alterations to any of the output parameters "var" in MomentSkewKurtosis() and pass them back in to the partner function as input parameters. For example, suppose I wanted to increase the Skew of the data and get a new set of points back that would be the original set of points altered just enough to generate the new Skew value.
The problem is not easy, and probably better targetted at stats, but I'll give you a pointer to a paper that I think is very good, and straight to the mark: Towards the Optimal Reconstruction of a Distribution from its Moments
Hope this helps!
Obviously you can't reconstruct an arbitrary density distribution from a finite amount of variables. You can create a distribution which fits the parameters, but it's not necessarily the original distribution.
And as far as I remember Mean, Variance, Skew and Kurtosis are just functions of the first 4 momenta. So you can't choose them independently from the momenta.
On the other hand there exists a function which you can apply on each data member and that produces a new dataset with the desired properties. I suspect that since you fixed the first 4 momenta it's a polynomial of grade 3.

What are the efficient and accurate algorithms to exclude outliers from a set of data?

I have set of 200 data rows(implies a small set of data). I want to carry out some statistical analysis, but before that I want to exclude outliers.
What are the potential algos for the purpose? Accuracy is a matter of concern.
I am very new to Stats, so need help in very basic algos.
Overall, the thing that makes a question like this hard is that there is no rigorous definition of an outlier. I would actually recommend against using a certain number of standard deviations as the cutoff for the following reasons:
A few outliers can have a huge impact on your estimate of standard deviation, as standard deviation is not a robust statistic.
The interpretation of standard deviation depends hugely on the distribution of your data. If your data is normally distributed then 3 standard deviations is a lot, but if it's, for example, log-normally distributed, then 3 standard deviations is not a lot.
There are a few good ways to proceed:
Keep all the data, and just use robust statistics (median instead of mean, Wilcoxon test instead of T-test, etc.). Probably good if your dataset is large.
Trim or Winsorize your data. Trimming means removing the top and bottom x%. Winsorizing means setting the top and bottom x% to the xth and 1-xth percentile value respectively.
If you have a small dataset, you could just plot your data and examine it manually for implausible values.
If your data looks reasonably close to normally distributed (no heavy tails and roughly symmetric), then use the median absolute deviation instead of the standard deviation as your test statistic and filter to 3 or 4 median absolute deviations away from the median.
Start by plotting the leverage of the outliers and then go for some good ol' interocular trauma (aka look at the scatterplot).
Lots of statistical packages have outlier/residual diagnostics, but I prefer Cook's D. You can calculate it by hand if you'd like using this formula from mtsu.edu (original link is dead, this is sourced from archive.org).
You may have heard the expression 'six sigma'.
This refers to plus and minus 3 sigma (ie, standard deviations) around the mean.
Anything outside the 'six sigma' range could be treated as an outlier.
On reflection, I think 'six sigma' is too wide.
This article describes how it amounts to "3.4 defective parts per million opportunities."
It seems like a pretty stringent requirement for certification purposes. Only you can decide if it suits you.
Depending on your data and its meaning, you might want to look into RANSAC (random sample consensus). This is widely used in computer vision, and generally gives excellent results when trying to fit data with lots of outliers to a model.
And it's very simple to conceptualize and explain. On the other hand, it's non deterministic, which may cause problems depending on the application.
Compute the standard deviation on the set, and exclude everything outside of the first, second or third standard deviation.
Here is how I would go about it in SQL Server
The query below will get the average weight from a fictional Scale table holding a single weigh-in for each person while not permitting those who are overly fat or thin to throw off the more realistic average:
select w.Gender, Avg(w.Weight) as AvgWeight
from ScaleData w
join ( select d.Gender, Avg(d.Weight) as AvgWeight,
2*STDDEVP(d.Weight) StdDeviation
from ScaleData d
group by d.Gender
) d
on w.Gender = d.Gender
and w.Weight between d.AvgWeight-d.StdDeviation
and d.AvgWeight+d.StdDeviation
group by w.Gender
There may be a better way to go about this, but it works and works well. If you have come across another more efficient solution, I’d love to hear about it.
NOTE: the above removes the top and bottom 5% of outliers out of the picture for purpose of the Average. You can adjust the number of outliers removed by adjusting the 2* in the 2*STDDEVP as per: http://en.wikipedia.org/wiki/Standard_deviation
If you want to just analyse it, say you want to compute the correlation with another variable, its ok to exclude outliers. But if you want to model / predict, it is not always best to exclude them straightaway.
Try to treat it with methods such as capping or if you suspect the outliers contain information/pattern, then replace it with missing, and model/predict it. I have written some examples of how you can go about this here using R.

Resources