Comparing count data with lots of zeroes - statistics

I'm not one to search for the most tenuous significant difference I can find, but hear me out.
I have some count data with four groups (3 of these can be combined to one, if necessary), groups A, B, C, and X.
Looking at the means and interval plots, X is clearly greater than the others (in terms of mean value), yet I cannot find any statistical test to back this up. This is, I believe, somewhat due to a high variability within groups and the large number of zero values.
I have tried normalized, removing zeroes, parametric, non-parametric, and more, with no success!
Any advice would be greatly appreciated as to how to approach this.
Many thanks.
The link below has the raw data. Groups A, B, and C can be combined into one group if it is relevant.
https://drive.google.com/open?id=0B6iQ6-J6e2TeU25Rd2hsd0Uxd2c

Related

Is there a way to evenly distribute colours amongst all combinations of a nCr combination? For example 10C6

I have a list of all of the combinations of 10c6. I would like to assign 6 colours, one to each number in each combination and have it so that:-
The sums across the 6 colours are equal
Each colour appears an equal number of times in each place in the combination. (Given 10c6=210, this will be 210/6=35 times in each location for each colour.)
Each colour should cover each number from 1-10 an equal number of times. Given 10c6=210, this will be 210/10=21 times for each number for each colour.
I have come quite close to the solution by trial and error in excel, but I wonder if a solution is even possible and how to find the solution.
This is the closest I could get through a fair bit of trial and error:
Please let me know if this isn't clear and I can explain in more details. I imagine it's quite a complex math's problem, but couldn't find a name for it.

Fast token overlap between strings

I have two sets of tokenised sentences A and B and I want to calculate the overlap between them in terms of common tokens. For example, the overlap between two individual sentences a1 today is a good day and b1 today I went to a park is 2 (today and a). I need a simple string matching method, without fuzzy or advanced methods. So the result is a matrix between all sentences in A and B with an overlap count for each pair.
The problem is that, while trivial, it is a quadratic operation (size of A x size of B pair-wise comparisons). With large data, the computation gets very slow very quickly. What would be a smart way of computing this avoiding pair-wise comparisons or doing them very fast? Are there packages/data structures particularly good for this?

How can I obtain hourly readings from 24 hour moving average data?

I have an excel dataset of 24-hour moving averages for PM10 air pollution concentration levels, and need to obtain the individual hourly readings from them. The moving average data is updated every hour, so at hour t, the reading is the average of the 24 readings from t-23 to t hours, and at hour t+1, the reading is the average of t-22 to t+1, etc. I do not have any known data points to extrapolate from, just the 24-hour moving averages.
Is there any way I can obtain the individual hourly readings for time t, t+1, etc, from the moving average?
The dataset contains data over 3 years, so with 24 readings a day (at every hour), the dataset has thousands of readings.
I have tried searching for a possible way to implement a simple excel VBA code to do this, but come up empty. Most of the posts I have seen on Stackoverflow and stackexchange, or other forums, involve calculating moving averages from discrete data, which is the reverse of what I need to do here.
The few I have seen involve using matrices, which I am not very sure how to implement.
(https://stats.stackexchange.com/questions/67907/extract-data-points-from-moving-average)
(https://stats.stackexchange.com/questions/112502/estimating-original-series-from-their-moving-average)
Any suggestions would be greatly appreciated!
Short answer: you can't.
Consider a moving average on 3 points. And even consider we multiply each MA term by 3, so we really have sums of consecutive
Data: a b c d e f g
MA a+b+c
b+c+d
c+d+e
d+e+f
e+f+g
With initial values, you can do something. To find the value of d, you would need to know b+c, hance to know a (since a+b+c is known). Then to find e, you know c+d+e and d, so you must find c, and since a is already needed, you will need also need b.
More generally, for a MA of length n, if you know the first n-1 values (hence also the nth, since you know the sum), then you can find all subsequent values. You can also start from the end. But basically, if you don't have enough original data, you are lost: there is a 1-1 relation between the n-1 first values of your data and the possible MA series. If you don't have enough information, there are infinitely many possibilities, and you can't decide which one is right.
Here I consider the simplest MA where the coefficient of each variable is 1/n (hence you compute the sum and divide by n). But this would apply to any MA, with slightly more complexity to account for different coefficients for each term in the sum.

SUM the column in parts and it calculates different to the entire column SUM

If I sum a large column separately in individual sections, and sum those results, it adds up to be slightly different than if I sum the entire column at once.
The value is off by ~ 0.00000000001 - but my conditional formatting picks this up and it is different - despite the fact they are summing the same values.
The formatting of all cells are set to 'Number'.
I can't figure out why or how this would happen - does anyone have some idea? Has something like this happened to you before when working with accurate values?
I found this article on Microsoft's web site. It discusses limitations in Excel's arithmetic and possible ways to deal with them.
I can't imagine that your input numbers have 15 digits of precision, so probably the easiest solution is to round your multiplication/division/etc. results (which I assume you have to get you to the 15 decimal digits).

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Resources