Statistical Tests for machine learning performances - statistics

I have two sets of machine learning outputs. Test A and Test B (paired).
I have done t.test to find statiscal difference between the two.
I am trying to find best set and could do mean score of all - but are there any other tests that can compared outcomes to conclude which one is better?
recommendation of tests..

Related

How to interpret test Welch and which tests post hoc should be used after test Welch

I use ANOVA, including the Levene test, unidimensional significance tests, descriptive statistics and the Tukey test. I have some doubts about what happens when the assumption of homogeneity of variance is not met in the Levene test. In many of the available materials I found the information:
"Basic assumptions of ANOVA tests:
Independence of random variables in the populations (groups) under consideration.
Measurability of the analysed variables.
Normality of the distribution of the variables in each population (group).
Homogeneity of the variance in all populations (groups).
If one of the first three assumptions is not met in the analysis of variance, the non-parametric Kruskal-Wallis test should be used. If the assumption of homogeneity of variances is not met, the Welch test should be used to assess the means."
If we had heterogeneous variances and a non-normal distribution then we would apply the Kruskal-Wallis test, right? On the other hand, what if we have both heterogeneous variances and a normal distribution, do we use the Welch test? If we do the Welch test, how should it be interpreted and what tests are subsequently recommended to see statistically significant differences between groups?
I would be very grateful for an answer

Nonparametric statistical significance test for dependent samples with dependent observations

The goal of my research is to establish whether one model outperforms the other (for a single dataset!!!) and the result is statistically significant.
The procedure is as follows for every model out of the two: I use 10-fold CV and repeat the procedure 3 times with different seeds to obtain, let's say, 30 estimates of precision. Hence, I obtain two sets of 30 estimates based on a single dataset.
Test for normality showed that the 30 estimates are not normally distributed. Thus, I need to resort to a nonparametric test. I considered Wilcoxon Signed-Rank Test yet the test is not suitable for the case when the estimates are dependent (due to CV). How could I tackle this situation?

How to get post-hoc tests using the Mixed Models < Generalised Mixed builder on SPSS

I am working in SPSS using the "Generalized Mixed" option under "Mixed Models" to run a GLMM. I have a repeated measures design. I am seeing whether repeated sessions (5) had an effect on dogs' approach (Y/N) to three bowl locations. My outcome variable is binary: approach yes/no. Session number (1,2,3,4,5) and Bowl Position (A,B,C) are fixed effects and DogID is a random effect. I have run a Generalised Linear Mixed Model with binary logistic regression link to the linear model but cannot see on the model builder SPSS display anywhere to run post-hoc tests. I would like to run Tukey's. I know that you can run post-hoc tests using the "General Linear Model" < "Univariate" tabs but this option cannot account for the binary outcome variable as far as I can tell. Does anyone know how to run post-hoc tests on this model builder in SPSS? I understand that many use R for this type of analysis but I am not proficient yet.

Decision Trees - Scikit, Python

I am trying to create a decision tree based on some training data. I have never created a decision tree before, but have completed a few linear regression models. I have 3 questions:
With linear regression I find it fairly easy to plot graphs, fit models, group factor levels, check P statistics etc. in an iterative fashion until I end up with a good predictive model. I have no idea how to evaluate a decision tree. Is there a way to get a summary of the model, (for example, .summary() function in statsmodels)? Should this be an iterative process where I decide whether a factor is significant - if so how can I tell?
I have been very unsuccessful in visualising the decision tree. On the various different ways I have tried, the code seems to run without any errors, yet nothing appears / plots. The only thing I can do successfully is tree.export_text(model), which just states feature_1, feature_2, and so on. I don't know what any of the features actually are. Has anybody come across these difficulties with visualising / have a simple solution?
The confusion matrix that I have generated is as follows:
[[ 0 395]
[ 0 3319]]
i.e. the model is predicting all rows to the same outcome. Does anyone know why this might be?
Scikit-learn is a library designed to build predictive models, so there are no tests of significance, confidence intervals, etc. You can always build your own statistics, but this is a tedious process. In scikit-learn, you can eliminate features recursively using RFE, RFECV, etc. You can find a list of feature selection algorithms here. For the most part, these algorithms get rid off the least important feature in each loop according to feature_importances (where the importance of each feature is defined as its contribution to the reduction in entropy, gini, etc.).
The most straight forward way to visualize a tree is tree.plot_tree(). In particular, you should try passing the names of the features to feature_names. Please show us what you have tried so far if you want a more specific answer.
Try another criterion, set a higher max_depth, etc. Sometimes datasets have unidentifiable records. For example, two observations with the exact same values in all features, but different target labels. Is this the case in your dataset?

Looking for a simple machine learning approach to predict final exam score from training set

I am trying to predict test reuslts based on known previous scores. The test is made up of three subjects, each contributing to the final exam score. For all students I have their previous scores for mini-tests in each of the three subjects, and I know which teacher they had. For half of the students (the training set) I have their final score, for the other half I don't (the test set). I want predict their final score.
So the test set looks like this:
student teacher subject1score subject2score subject3score finalscore
while the test set is the same but without the final score
student teacher subject1score subject2score subject3score
So I want to predict the final score of the test set students. Any ideas for a simple learning algorithm or statistical technique to use?
The simplest and most reasonable method to try is a linear regression, with the teacher and the three scores used as predictors. (This is based on the assumption that the teacher and the three test scores will each have some predictive ability towards the final exam, but they could contribute differently- for example, the third test might matter the most).
You don't mention a specific language, but let's say you loaded it into R as two data frames called 'training.scoresandtest.scores`. Fitting the model would be as simple as using lm:
lm.fit = lm(finalscore ~ teacher + subject1score + subject2score + subject3score, training.scores)
And then the prediction would be done as:
predicted.scores = predict(lm.fit, test.scores)
Googling for "R linear regression", "R linear models", or similar searches will find many resources that can help. You can also learn about slightly more sophisticated methods such as generalized linear models or generalized additive models, which are almost as easy to perform as the above.
ETA: There have been books written about the topic of interpreting linear regression- an example simple guide is here. In general, you'll be printing summary(lm.fit) to print a bunch of information about the fit. You'll see a table of coefficients in the output that will look something like:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.4511 7.0938 -2.037 0.057516 .
setting 0.2706 0.1079 2.507 0.022629 *
effort 0.9677 0.2250 4.301 0.000484 ***
The Estimate will give you an idea how strong the effect of that variable was, while the p-values (Pr(>|T|)) give you an idea whether each variable actually helped or was due to random noise. There's a lot more to it, but I invite you to read the excellent resources available online.
Also plot(lm.fit) will graphs of the residuals (residuals mean the amount each prediction is off by in your testing set), which tells you can use to determine whether the assumptions of the model are fair.

Resources