Differentiate between 2 correlations - correlation vs causation - statistics

I am comparing two types of crops, let's call them crop A and B.
I have data from ~1000 farms on growth of the plants (average per farm) and want to correlate growth to crop type.
Unfortunately, the different farms also use different fertilizers (fertilizer 1...10), and some have changed the fertilizer used over time...
So, I want to show (with statistical significance) that the growth of crop type A exceeds the growth of crop type B, but make sure it is not coincidence because of the fertilizer used. Can you point me to a statistical test for this purpose? Or do I need to split the data into subgroups (that each contain only one fertilizer) and draw separate conclusions from each subgroup?
Thanks for any hints!
best wishes
Peter.

The type of fertilizer is a confounding variable which you need to control, in order to reduce it's effect on your statistical test.
Assuming all crop types might use all fertilizer types, a good way to control that confounding variable is by simple stratification
The data sampled is divided into two group (crop A, crop B) which are stratified by fertilizer type – to reduce its impact.

Related

modelling multiplicative relationships with categorical data

If I want to create a model that best describes the price of an asset using a multiplicative relationship, that is,
Price = base_rate * size_of_asset * number_of_subassets
(size of asset, number of subassets are both 0,1,2,3... N)
can I do this with a linear combination when the variables are categorical? If they were numerical I could log everything, which would do exactly that... however, the same approach can't be applied with categorical data, can it?
NB: I want to keep it as a multiplicative relationships so it's highly interpretable from a ratio perspective - that is, one can say by increasing the size_of_asset by 30% increases the price by x amount.
Thanks for the advice!
I think log-linear might be your solution as it can help you analyse the multiplicative effects of one or more categorical independent variables with a categorical dependent variable.
Check this out:
http://members.home.nl/jeroenvermunt/esbs2005c.pdf

Quasi-Monte-Carlo vs. variable dimensionality?

I've been looking through the Matlab documention on using quasi-random sampling of N-dimensional unit cubes. This represents a problem with N stochastic parameters. Based on the fact that it is a unit cube, I presume that I need to use the inverse CDF of each parameter to map from the [0,1] domain to the value range of each parameter.
I would like to try this on a problem for which I now use Monte Carlo. Unfortunately, the problem I'm analyzing does not have a fixed number of dimensions. For each instantiation of the problem, I generate a variable number of widgets (say) using a Poisson distribution. Only after that do I randomly generate the parameters for each widget. That whole process yields one instance of the problem to be analyzed, so the number of parameters varies from one instance to the next.
Is this kind of problem still amenable to Quasi-Monte-Carlo?
What I used once was to get highest possible dimension of the problem d, generate Sobol sequence in d and use whatever number of points necessary for a particular sampling. I would say it helped somewhat...
From talking to a much smarter colleague, we need to consider the various combinations of widget counts for each widget type. For example, if we have 2 of widget type#1, 4 of widget type #2, 1 of widget type #3, etc., that constitutes one combination. QMC can be applied to that one combination. We are assuming that number of widget#i is independent of the number of widget#j for i<>j, so the probability of each combination is just the product of p(2 widgets of type#1), p(4 widgets of type#2), p(1 widget of type#3), etc. The individual probabilities are easy to get from their Poisson distributions (or their flat distributions, or whatever distribution is being used). If there are N widget types, this is just a joint PMF in N-space. This probability is then used to weight the QMC result for that particular combination. Note that even when the exactly combination is nailed down, QMC is still needed because there each widget is associated with 3 stochastic parameters.

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

Data mining for significant variables (numerical): Where to start?

I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.

What are the efficient and accurate algorithms to exclude outliers from a set of data?

I have set of 200 data rows(implies a small set of data). I want to carry out some statistical analysis, but before that I want to exclude outliers.
What are the potential algos for the purpose? Accuracy is a matter of concern.
I am very new to Stats, so need help in very basic algos.
Overall, the thing that makes a question like this hard is that there is no rigorous definition of an outlier. I would actually recommend against using a certain number of standard deviations as the cutoff for the following reasons:
A few outliers can have a huge impact on your estimate of standard deviation, as standard deviation is not a robust statistic.
The interpretation of standard deviation depends hugely on the distribution of your data. If your data is normally distributed then 3 standard deviations is a lot, but if it's, for example, log-normally distributed, then 3 standard deviations is not a lot.
There are a few good ways to proceed:
Keep all the data, and just use robust statistics (median instead of mean, Wilcoxon test instead of T-test, etc.). Probably good if your dataset is large.
Trim or Winsorize your data. Trimming means removing the top and bottom x%. Winsorizing means setting the top and bottom x% to the xth and 1-xth percentile value respectively.
If you have a small dataset, you could just plot your data and examine it manually for implausible values.
If your data looks reasonably close to normally distributed (no heavy tails and roughly symmetric), then use the median absolute deviation instead of the standard deviation as your test statistic and filter to 3 or 4 median absolute deviations away from the median.
Start by plotting the leverage of the outliers and then go for some good ol' interocular trauma (aka look at the scatterplot).
Lots of statistical packages have outlier/residual diagnostics, but I prefer Cook's D. You can calculate it by hand if you'd like using this formula from mtsu.edu (original link is dead, this is sourced from archive.org).
You may have heard the expression 'six sigma'.
This refers to plus and minus 3 sigma (ie, standard deviations) around the mean.
Anything outside the 'six sigma' range could be treated as an outlier.
On reflection, I think 'six sigma' is too wide.
This article describes how it amounts to "3.4 defective parts per million opportunities."
It seems like a pretty stringent requirement for certification purposes. Only you can decide if it suits you.
Depending on your data and its meaning, you might want to look into RANSAC (random sample consensus). This is widely used in computer vision, and generally gives excellent results when trying to fit data with lots of outliers to a model.
And it's very simple to conceptualize and explain. On the other hand, it's non deterministic, which may cause problems depending on the application.
Compute the standard deviation on the set, and exclude everything outside of the first, second or third standard deviation.
Here is how I would go about it in SQL Server
The query below will get the average weight from a fictional Scale table holding a single weigh-in for each person while not permitting those who are overly fat or thin to throw off the more realistic average:
select w.Gender, Avg(w.Weight) as AvgWeight
from ScaleData w
join ( select d.Gender, Avg(d.Weight) as AvgWeight,
2*STDDEVP(d.Weight) StdDeviation
from ScaleData d
group by d.Gender
) d
on w.Gender = d.Gender
and w.Weight between d.AvgWeight-d.StdDeviation
and d.AvgWeight+d.StdDeviation
group by w.Gender
There may be a better way to go about this, but it works and works well. If you have come across another more efficient solution, I’d love to hear about it.
NOTE: the above removes the top and bottom 5% of outliers out of the picture for purpose of the Average. You can adjust the number of outliers removed by adjusting the 2* in the 2*STDDEVP as per: http://en.wikipedia.org/wiki/Standard_deviation
If you want to just analyse it, say you want to compute the correlation with another variable, its ok to exclude outliers. But if you want to model / predict, it is not always best to exclude them straightaway.
Try to treat it with methods such as capping or if you suspect the outliers contain information/pattern, then replace it with missing, and model/predict it. I have written some examples of how you can go about this here using R.

Resources