machine learning: convention for ( row, column ) variables ( truth, prediction ) in a confusion matrix? - confusion-matrix

in machine learning, is there an established convention for which are the ( row, column ) variables ( truth, prediction ) in a confusion matrix?
in statistics it has long been true that the convention is that the columns are truth (e.g., the actual disease status) and the rows are predictions (e.g., the results from a screening test)
it would be both confusing (pun intended) and dismaying if the two fields had opposite conventions
thanks and best wishes, david draper

Related

Statistically compare an experimental data and a theory value

I appreciate if someone could answer my question. I have two values. The first one is experimental data deduced from only one measurement. The uncertainty for this values is determined. The second value is theory result. My question is how to statistically compare these two values? I tried to use t-test, but failed because the number of freedom is df = 1-1 =0 (only one experiment was conducted to measure the first value).

I am wondering if the statistical analysis I did makes any sense

I am helping with a retrospective study and the data isn't very well organized. Also, I am new to statistics, so I took a stab at analyzing the data myself. We will be getting the help of a statistician later on, but not sure when yet.
We are looking at about 100 patients and each patient was followed up with for a variable amount of time. Throughout each patient's follow-up, there were a variable amount of observations made at various timepoints. The observations included a set of lab values, anthropometric data, and demographic data. So to conduct the analysis, we split up the observations into time bins (eg. 6 months follow-up, 1 year follow up, etc). Then for each time point, we categorized each patient in one of 3 groups based on the outcome of interest. Also, for each time point, we selected one observation to represent one patient during that timepoint (since there could be many within the same time bin). For the analysis, we did the following:
1 . ANOVA within each timepoint to compare the 3 groups of outcomes . Looking at select independent variables of interest.
2 . For the same variables of interest above, do a repeated measures ANOVA to see if it's changing over time.
3 . Test for correlations between the variables of interest mentioned above and other independent variables.
4 . Test each independent variable in a univariate binomial logistic regression to see if it predicts outcome. There were 3 groups, so we did pairwise regressions (eg. (outcome 1 + 2) vs (outcome 3), and (outcome 1) vs (outcome 2 + 3)).
5 . Do a multivariate binomial logistic regression with forward elimination using only the significant independent variables retained from step 4.
6 . If any independent variables of interest are retained in the MV regression, run it again testing for potential interactions with any variables it was correlated with from step 3. We tried to do this by making a new variable that is the product of the two variables and putting it into the regression.
What I'm trying to show with this analysis is that one key independent variable explains the difference in outcomes among the patients. So far my analysis seems to be doing this, as it seems to be one of the few variables retained at step 6 and with a good significance value. So sorry if this is very confusing to read.

Can we add non-predictor variable in reference group in cox regression SPSS?

I am running cox-regression in SPSS and have 'age-quartiles' with values 1,2,3,4 as a variable. How do I calculate risk for disease 'D' for a said predictor 'P' based on these quartiles keeping the first age quartile '1' as reference group? Age in this case is not a predictor in the analysis.
I think this is the same question as this one.
If you specify your age quartiles variable as a strata or stratification variable, then the relative risk estimates for the age quartiles as a function of your predictors are all the same. All that will differ will be the baseline functions.

Is Chi-Square Test correct here?

I have a data set that talks about each age group of people answering total questions. The columns tell how many levels they passed. Here is how it looks like:
To calculate significance between age groups, i did a chi square test.
I calculated Chi value and it is unusually large. Is it expected or should i use a different test?
If you want to test whether the two variables 'age range' and 'levels' are independent then the chi-square test could be an option. However, note that in order to use that test in a feasible way the expected frequency for each category should be greater than or equal to five and it does not seem to be the case in your example.
An alternative to chi-square test is, in these cases, Fisher's exact test, but it is computationally very expensive (only realistic for 2x2 tables and, with Freeman Halton's extension, maybe for only slightly larger tables). A realistic alternative would be to group categories and reduce its number.

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Resources