I'm trying to compare the variance explained by one factor in ANOVA, across three different experimental conditions. For each condition I performed the same ANOVA, and r squared = SS_factor/SS_total. Is there a way to compared the different r squared values for different conditions?
If not, what is the best way to make the comparison between conditions in the case? Thanks!
Related
I have two data sets (each with datapoints + standard deviation) and want to check whether they are statistically different. What kind of test would be appropriate?
Thank you!
The answer depends. If blue and red samples are randomly obtained, and the same group of items but measured at different times, then paired two-sample t-test applies. If they belong to different groups, the unpaired two-sample t-test is suitable. This decision is based on the assumption that both blue and red samples are normally distributed or can be transformed to a normal distribution by means of a logarithmic transformation. Otherwise, you need to implement Mann-Whitney test. The data values to be used are Output percentages given the same Input value. Data values should be continuous as in your case.
The ANOVA table F test is the most suitable procedure for data analysis in experimentation when more than two means are to be compared. From a statistical point of view, why the t-test for the difference of two means is not suitable for multiple comparison?
I have been reading a paper and I have found a table that felt very strange for me. The researchers have calculated r value and related it to a categorical factor. From my knowledge r is a function of correlation between two numerical variables.
Any ideas
enter image description here
When you want to assess the association between a continuous variable and a dichotomous variable (e.g. gender), you would typically use the point-biserial correlation
However, pearson and point-biserial are mathematically equivalent and will give you the same correlation value. So technically their values are correct, just that their naming might be a bit misleading/confusing
At a very high level this is similar to the nearest neighbor search problem.
From wiki: "given a set S of points in a space M and a query point q ∈ M, find the closest point in S to q".
But some significant differences. Specifics:
Each point is described by k variables.
The variables are not all numerical. Mixed data types:
string, int etc.
All possible values for all variables not known - but they come from reasonably small sets.
In the data set to search from there will be multiple points with same values for all the k variables.
Another way to look at this is there will be many duplicate points.
For each point lets call the number of duplicates as frequency.
Given a query point q need to find nearest neighbor p such that frequency of p is at-least 15
There seems to be a wide range of of algorithms around NNS and statistical classification and best bin match.
I am getting a little lost in all the variations. Is there already a standard algorithm I can use. Or would I need to modify one?
I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".