I have two 6x4 contingency tables for frequency data. They are based on the same type of sampling criteria of a number of discreet variables but for two condition (before and after). I would like to compare these statistcally to see how much - or not - they differ.
A Chi square related test seems appropriate but normally this gives a result in comparison to the theoretical to calculate the statistic. So in other words I need to swap the theoretical for the second table. Of course it doesn't have to be a basic chi square test - any other appropriate test would be ok.
I have access to XLSTAT, Excel and SPSS. And would appreciate some help on this.
Related
I want to compare the output of my simulation model to the observed data in various ways, like using an independent t-test to compare the means. However, when I do the independent t-test in SPSS, I get a different result than the independent t-test in Excel. I don't know why so I don't know which one I should use. Can anybody tell me why the results are different?
Here is the independent t-test in SPSS (with t-value 0,181 and p-value 0,857):
Here is the t-test: Two-Sample Assuming Equal Variances in Excel (also the t-test assuming unequal variances is different than the one in SPSS):
After running the TTest on your original data in SPSS, with the FULL data, I got perfectly matching results from spss and excel (see below).
The problem you had has to do with one case seemingly missing from your "observed" set (as noted by #TomSharpe).
This could be due to a simple error in copying the data to SPSS. In SPSS you had to have the two ranges in the same column - you may have done this manually with an error - in which case you should learn to use restructure commands, namely varstocases, to avoid such mistakes.
On the other hand, if the data is full but still a case seems to be missing from the analysis, you should check if the data is weighted by another variable. Weighted data can change the apparent number of observarions in the analysis, and of course change the results.
.
Given the following data for 12 users:
username, number of deals for control, revenue from test, revenue from control
Here's an example of how the data looks like
Can you help me figure out how I can calculate the significance of the hypothesis that the test is more profitable (preferably using excel)?
The measure I was thinking of using was the % of lift in revenues for each customer.
P.s. I have a background in statistics but not an expert so please keep it as simple as possible.
Since each pair of incomes refers to the same individual, you can perform a paired t-test.
Variable 1: Control income
Variable 2: Deals income
Then follow these instructions (copied here for posterity):
In Excel, click Data Analysis on the Data tab.
From the Data Analysis popup, choose t-Test: Paired Two Sample for Means.
Under Input, select the ranges for both Variable 1 and Variable 2.
In Hypothesized Mean Difference, you’ll typically enter zero. This value is the null hypothesis value, which represents no effect. In
this case, a mean difference of zero represents no difference between
the two methods, which is no effect.
Check the Labels checkbox if you have meaningful variables labels in row 1. This option helps make the output easier to interpret. Ensure
that you include the label row in step #3.
Excel uses a default Alpha value of 0.05, which is usually a good value. Alpha is the significance level. Change this value only when
you have a specific reason for doing so.
Click OK.
Alternatively, you can indeed calculate the difference between the two incomes, and then perform a one sample t-test (assuming that the difference is zero). However, such a test is not available out-of-the-box in Excel; the procedure is described here.
I have the following 3 cases of a numeric metric on a time series(t,t1,t2 etc denotes different hourly comparisons across periods)
If you notice the 3 graphs t(period of interest) clearly has a drop off for image 1 but not so much for image 2 and image 3. Assume this is some sort of numeric metric(raw metric or derived) and I want to create a system/algo which specifically catches case 1 but not case 2 or 3 with t being the point of interest. While visually this makes sense and is very intuitive I am trying to design a way to this in python using the dataframes shown in the picture.
Generally the problem is how do I detect when the time series is behaving very differently from any of the prior weeks.
Edit: When I say different what I really mean is, my metric trends together across periods in t1 to t4 but if they dont and try to separate out of the envelope, that to me is an anomaly. If you notice chart 1 you can see t tries to split out from rest of the tn this is an anomaly for me. in other cases t is within the bounds of other time periods. Hope this helps.
With small data the best is if you can come up with a good transformation into a simpler representation.
In this case I would try the following:
Distance to the median along the time-axis. Then a summary of that, could be median, Mean-Squared-Error etc
Median of the cross-correlation of the signals
I have a data set that talks about each age group of people answering total questions. The columns tell how many levels they passed. Here is how it looks like:
To calculate significance between age groups, i did a chi square test.
I calculated Chi value and it is unusually large. Is it expected or should i use a different test?
If you want to test whether the two variables 'age range' and 'levels' are independent then the chi-square test could be an option. However, note that in order to use that test in a feasible way the expected frequency for each category should be greater than or equal to five and it does not seem to be the case in your example.
An alternative to chi-square test is, in these cases, Fisher's exact test, but it is computationally very expensive (only realistic for 2x2 tables and, with Freeman Halton's extension, maybe for only slightly larger tables). A realistic alternative would be to group categories and reduce its number.
I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm