post-hoc comparisons for GAMs modeling proportion data - gam

I’m using GAMs to analyze germination data across a range of 5 constant temperatures. My response variable is total germination proportion (germinated/viable seeds) at the end of the experiment. My experimental variables are the substrate on which seeds are germinating and temperature (I’m linking 5 different temperatures into a continuous variable) to generate a curve for each of the substrates. My experimental question is whether germination on one substrate expands the range of temperatures on which seeds can germinate over another substrate and if so at what portion of the temperature range does this occur (or in other words how does the effect differ at different temperatures)?
I know that is is possible to conduct pair-wise comparisons to determine if the germination response on different substrates differed significantly from each other (analogous to main factor results in glm), but what I want to do is to compare where across the germination curve (x axis = temperatures [5, 10, 15, 20, 25 C] and y axis=cumulative germination proportion at the end of the trial) did germination differ. With my limited understanding, my thinking goes that if I able to splice the data at specific temperature intervals or use loess smoothers with a specific window width (correlated to my temperature profiles) and then get the predicted values for germination within each of those windows or splices, then I could compare them using some sort of post-hoc testing (analogous to Tukey HSD, etc.), but I’m not sure this is possible.
If this is not possible, would it be sufficient for publication and data interpretation to use main effect p-values to first address whether total germination differs among substrates across the continuous temperature range (5-25 C) and then use data visualisation (e.g. plotting the three germination curves next to each other) in order to visually answer the question of where on the curve are differences most pronounced?
I really appreciate any insight you might have into this!

Related

outlier detection using 2D spatial information

I have a list of sensor measurements for air quality with geo-coordinates, and I would like to implement outlier detection. The list of sensors is relatively small (~50).
The air quality can gradually change with the distance, but abrupt local spikes are likely outliers. If one sensor in the group of closely located sensors shows a higher value it could be an outlier. If the same higher value is shown by more distant sensors it might be OK.
Of course, I can ignore coordinates and do simple outlier detection assuming the normal distribution, but I was hoping to do something more sophisticated. What would be a good statistical way to model this and implement outlier detection?
The above statement, ("If one sensor in the group of closely located sensors shows a higher value it could be an outlier. If the same higher value is shown by more distant sensors it might be OK."), would indicate that sensors that are closer to each other tend to have values that are more alike.
Tobler’s first law of geography - “everything is related to everything else, but near things are more related than distant things”
You can quantify an answer to this question. The focus is should not be on the location and values from outlier sensors. Use global spatial autocorrelation to answer the degree to which sensors that are near each other tend to be more alike.
As a start, you will first need to define neighbors for each sensor.
I'd calculate a cost function, consisting of two costs:
1: cost_neighbors: Calculates the deviance from the sensor value of an expected value. The expected value is calculated by summing up all the values and weighting them by their distance.
2: cost_previous_step: Check how much the value of the sensor changed compared to the last time step. Large change in value leads to a large cost.
Here is some pseudo code describing how to calculate the costs:
expected_value = ((value_neighbor_0 / distance_neighbor_0)+(value_neighbor_1 / distance_neighbor_1)+ ... )/nb_neighbors
cost_neighbors = abs(expected_value-value)
cost_previous_timestep = value#t - value#t-1
total_cost = a*cost_neighbors + b*cost_previous_timestep
a and b are parameters that can be tuned to give each of the costs more or less impact. The total cost is then used to determine if a sensor value is an outlier, the larger it is, the likelier it is an outlier.
To figure out the performance and weights, you can plot the costs of some labeled data points, of which you know if they are an outlier or not.
cost_neigbors
| X
| X X
|
|o o
|o o o
|___o_____________ cost_previous_step
X= outlier
o= non-outlier
You can now either set the threshold by hand or create a small dataset with the labels and costs, and apply any sort of classifier function (e.g. SVM).
If you use python, an easy way to find neighbors and their distances is scipy.spatial.cKDtree

How to design a score or signature function based on the time series data

I want to design a score or signature function based on a time series signal. Usually, the signal has ups and downs.
For a given time window, I desire to design the score function based on the number of times it fluctuates, the duration of the fluctuations, and the magnitude of the fluctuations. I am wondering what kind of math I can use to design the function. I am not sure if the statistical features (mean, median, and so on) would be enough to design unique function such that two time windows would be distinguishable.
Thanks!
Summary statistics will not give you what you want... but it can still be useful.
Things you can try:
Zero crossings on the signal will give you number of fluctuations. You'll have to use some central tendency value to move the signal about the 0 line in order to do this. Alternatively you can use FFT on the original to find the harmonic frequency as part of the score.
Could define the duration of fluctuations as the difference between zero crossings divided by two (since one fluctuation will reach the 0-line twice).
Magnitude can be done by finding the local minima and maxima - check out some packages with peak finding functions. You might want to use the mean or median to rule out local minima and maxima that fall on the wrong side of the line. Alternatively, finding the zero crossings on the derivative signal and then mapping them back to the original will give you all the local minima and maxima as well.

How to identify data points that are significantly smaller than the others in a data set?

I have an array of data points of real value. I wish to identify those data points whose values are significantly smaller than others. Are there any well-known algorithms?
For example, the data set can be {0.01, 0.32, 0.45, 0.68, 0.87, 0.95, 1.0}. I can manually tell that 0.01 is significantly smaller than the others. However, I would like to know are there any analysis method for this purpose in statistics area? I tried outlier detection in my data set, but it cannot find any outliers (such as detecting 0.01 as outlier).
I have deleted a segment I wrote explaining the use of zscores for your problem but it was incorrect, I hope the information below is accurate, just in case, use it as a guide only...
The idea is to build a z-distribution from the scores you are testing, minus the test score, and then use that distribution to get a zscore of the test score. Any z greater than 1.96 is unlikely to belong to your test population.
I am not that this works properly because you remove your tests score' influence from the distribution, thus large scores will have inflated zscores because they contribute to a greater variance (the denominator in the zscore equation).
This could be a start till someone with a modicum of expertise sets us straight :)
e.g.
for i = 1:length(data_set)
test_score = data_set(i)
sample_pop = data_set(data_set~=test_score)
sample_mean = mean(sample_pop)
sample_stdev = std(sample_pop)
test_z(i) = (i-sample_mean)/sample_stdev
end
This can be done for higher dimensions by using the dim input for mean.

Data mining for significant variables (numerical): Where to start?

I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.

Create CDF for Anderson Darling test for Octave forge Statistics package function

I am using Octave and I would like to use the anderson_darling_test from the Octave forge Statistics package to test if two vectors of data are drawn from the same statistical distribution. Furthermore, the reference distribution is unlikely to be "normal". This reference distribution will be the known distribution and taken from the help for the above function " 'If you are selecting from a known distribution, convert your values into CDF values for the distribution and use "uniform'. "
My question therefore is: how would I convert my data values into CDF values for the reference distribution?
Some background information for the problem: I have a vector of raw data values from which I extract the cyclic component (this will be the reference distribution); I then wish to compare this cyclic component with the raw data itself to see if the raw data is essentially cyclic in nature. If the the null hypothesis that the two are the same can be rejected I will then know that most of the movement in the raw data is not due to cyclic influences but is due to either trend or just noise.
If your data has a specific distribution, for instance beta(3,3) then
p = betacdf(x, 3, 3)
will be uniform by the definition of a CDF. If you want to transform it to a normal, you can just call the inverse CDF function
x=norminv(p,0,1)
on the uniform p. Once transformed, use your favorite test. I'm not sure I understand your data, but you might consider using a Kolmogorov-Smirnov test instead, which is a nonparametric test of distributional equality.
Your approach is misguided in multiple ways. Several points:
The Anderson-Darling test implemented in Octave forge is a one-sample test: it requires one vector of data and a reference distribution. The distribution should be known - not come from data. While you quote the help-file correctly about using a CDF and the "uniform" option for a distribution that is not built in, you are ignoring the next sentence of the same help file:
Do not use "uniform" if the distribution parameters are estimated from the data itself, as this sharply biases the A^2 statistic toward smaller values.
So, don't do it.
Even if you found or wrote a function implementing a proper two-sample Anderson-Darling or Kolmogorov-Smirnov test, you would still be left with a couple of problems:
Your samples (the data and the cyclic part estimated from the data) are not independent, and these tests assume independence.
Given your description, I assume there is some sort of time predictor involved. So even if the distributions would coincide, that does not mean they coincide at the same time-points, because comparing distributions collapses over the time.
The distribution of cyclic trend + error would not expected to be the same as the distribution of the cyclic trend alone. Suppose the trend is sin(t). Then it never will go above 1. Now add a normally distributed random error term with standard deviation 0.1 (small, so that the trend is dominant). Obviously you could get values well above 1.
We do not have enough information to figure out the proper thing to do, and it is not really a programming question anyway. Look up time series theory - separating cyclic components is a major topic there. But many reasonable analyses will probably be based on the residuals: (observed value - predicted from cyclic component). You will still have to be careful about auto-correlation and other complexities, but at least it will be a move in the right direction.

Resources