How to get post-hoc tests using the Mixed Models < Generalised Mixed builder on SPSS - statistics

I am working in SPSS using the "Generalized Mixed" option under "Mixed Models" to run a GLMM. I have a repeated measures design. I am seeing whether repeated sessions (5) had an effect on dogs' approach (Y/N) to three bowl locations. My outcome variable is binary: approach yes/no. Session number (1,2,3,4,5) and Bowl Position (A,B,C) are fixed effects and DogID is a random effect. I have run a Generalised Linear Mixed Model with binary logistic regression link to the linear model but cannot see on the model builder SPSS display anywhere to run post-hoc tests. I would like to run Tukey's. I know that you can run post-hoc tests using the "General Linear Model" < "Univariate" tabs but this option cannot account for the binary outcome variable as far as I can tell. Does anyone know how to run post-hoc tests on this model builder in SPSS? I understand that many use R for this type of analysis but I am not proficient yet.

Related

model interaction between words for a sentiment analysis task

I am wondering what is the most appropriate way to model the interaction between two words/variables in a language model for a sentiment analysis task. For example, in the following dataset:
You didn't solve my problem,NEU
I never made that purchase,NEU
You never solve my problems,NEG
The words "solve" and "never", in isolation, doesn't have a negative sentiment. But, when they appear together, they do. Formally speaking: assuming we have a feature «solve» that takes the value 0 when the word «solve» is absent, and 1 when the word is present, and another feature «never» with the same logic: the difference in the probability of Y=NEG between «solve»=0 and «solve»=1 is different when «never»=0 and «never»=1.
But a basic logistic regression (using, for example, sklearn), wouldn't be able to handle this kind of situation.
With sklearn.preprocessing.PolynomialFeatures it is possible to add interaction coefficients, but it is hardly the most effective option.

Normality Assumption - how to check you have not violated it?

I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!

Decision Trees - Scikit, Python

I am trying to create a decision tree based on some training data. I have never created a decision tree before, but have completed a few linear regression models. I have 3 questions:
With linear regression I find it fairly easy to plot graphs, fit models, group factor levels, check P statistics etc. in an iterative fashion until I end up with a good predictive model. I have no idea how to evaluate a decision tree. Is there a way to get a summary of the model, (for example, .summary() function in statsmodels)? Should this be an iterative process where I decide whether a factor is significant - if so how can I tell?
I have been very unsuccessful in visualising the decision tree. On the various different ways I have tried, the code seems to run without any errors, yet nothing appears / plots. The only thing I can do successfully is tree.export_text(model), which just states feature_1, feature_2, and so on. I don't know what any of the features actually are. Has anybody come across these difficulties with visualising / have a simple solution?
The confusion matrix that I have generated is as follows:
[[ 0 395]
[ 0 3319]]
i.e. the model is predicting all rows to the same outcome. Does anyone know why this might be?
Scikit-learn is a library designed to build predictive models, so there are no tests of significance, confidence intervals, etc. You can always build your own statistics, but this is a tedious process. In scikit-learn, you can eliminate features recursively using RFE, RFECV, etc. You can find a list of feature selection algorithms here. For the most part, these algorithms get rid off the least important feature in each loop according to feature_importances (where the importance of each feature is defined as its contribution to the reduction in entropy, gini, etc.).
The most straight forward way to visualize a tree is tree.plot_tree(). In particular, you should try passing the names of the features to feature_names. Please show us what you have tried so far if you want a more specific answer.
Try another criterion, set a higher max_depth, etc. Sometimes datasets have unidentifiable records. For example, two observations with the exact same values in all features, but different target labels. Is this the case in your dataset?

How to see correlation between features in scikit-learn?

I am developing a model in which it predicts whether the employee retains its job or leave the company.
The features are as below
satisfaction_level
last_evaluation
number_projects
average_monthly_hours
time_spend_company
work_accident
promotion_last_5years
Department
salary
left (boolean)
During feature analysis, I came up with the two approaches and in both of them, I got different results for the features. as shown in the image
here
When I plot a heatmap it can be seen that satisfaction_level has a negative correlation with left.
On the other hand, if I just use pandas for analysis I got results something like this
In the above image, it can be seen that satisfaction_level is quite important in the analysis since employees with higher satisfaction_level retain the job.
While in the case of time_spend_company the heatmap shows it is important while on the other hand, the difference is not quite important in the second image.
Now I am confused about whether to take this as one of my features or not and which approach should I choose in order to choose features.
Some please help me with this.
BTW I am doing ML in scikit-learn and the data is taken from here.
Correlation between features have little to do with feature importance. Your heat map is correctly showing correlation.
In fact, in most of the cases when you talking about feature importance, you must provide context of a model that you are using. Different models may choose different features as important. Moreover many models assume that data comes from IID (Independent and identically distributed random variables), so correlation close to zero is desirable.
For example in sklearn learn regression to get estimation of feature importance you can examine coef_ parameter.

Need help applying scikit-learn to this unbalanced text categorization task

I have a multi-class text classification/categorization problem. I have a set of ground truth data with K different mutually exclusive classes. This is an unbalanced problem in two respects. First, some classes are a lot more frequent than others. Second, some classes are of more interest to us than others (those generally positively correlate with their relative frequency, although there are some classes of interest that are fairly rare).
My goal is to develop a single classifier or a collection of them to be able to classify the k << K classes of interest with high precision (at least 80%) while maintaining reasonable recall (what's "reasonable" is a bit vague).
Features that I use are mostly typical unigram-/bigram-based ones plus some binary features coming from metadata of the incoming documents that are being classified (e.g. whether them were submitted via email or though a webform).
Because of the unbalanced data, I am leaning toward developing binary classifiers for each of the important classes, instead of a single one like a multi-class SVM.
What ML learning algorithms (binary or not) implemented in scikit-learn allow for training tuned to precision (versus for example recall or F1) and what options do I need to set for that?
What data analysis tools in scikit-learn can be used for feature selection to narrow down the features that might be the most relevant to the precision-oriented classification of a particular class?
This is not really a "big data" problem: K is about 100, k is about 15, the total number of samples available to me for training and testing is about 100,000.
Thx
Given that k is small, I would just do this manually. For each desired class, train your individual (one vs the rest) classifier, take look at the precision-recall curve, and then choose the threshold that gives the desired precision.

Resources