How the quality of clusters made in SPSS can be evaluated? - statistics

How the quality of clusters made in SPSS with the method "Two-step clustering" can be evaluated? Which test should be applied to be sure that the quality is good.

Related

Normality Assumption - how to check you have not violated it?

I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!

when to normalize data with zscore (before or after split)

I was taking a udemy course, which made a strong case for normalizing only the train data (after the split from test data) since the model will typically used by fresh data, with features of the scale of the original set. And if you scale the test data, then you are not scoring the model properly.
On the other hand, what I found was that my two-class logistic regression model (created with Azure Machine Learning Studio) was getting terrible results after Z-Score scaling only the train data.
a. Is this a problem only with Azure's tools?
b. What is a good rule of thumb for when feature data needs to be scaled (one, two, or three orders of magnitude in difference)?
Not scoring the model properly due to normalized test set doesn't seem to make sense:
you would presumably also normalize data that you use for predictions in the future.
I found this similar question in datascience stackexchange and the top answer suggests not only that test data has to be normalized, but you need to apply the exact same scaling as you have done to the training data, because the scale of your data is also taken into account by your model: differently scaled test/prediction data would potentially lead to over/under-exaggeration of a feature.

Sensitivity Vs Positive Predicted Value - which is best?

I am trying to build a model on a class imbalanced dataset (binary - 1's:25% and 0's 75%). Tried with Classification algorithms and ensemble techniques. I am bit confused on below two concepts as i am more interested in predicting more 1's.
1. Should i give preference to Sensitivity or Positive Predicted Value.
Some ensemble techniques give maximum 45% of sensitivity and low Positive Predicted Value.
And some give 62% of Positive Predicted Value and low Sensitivity.
2. My dataset has around 450K observations and 250 features.
After power test i took 10K observations by Simple random sampling. While selecting
variable importance using ensemble technique's the features
are different compared to the features when i tried with 150K observations.
Now with my intuition and domain knowledge i felt features that came up as important in
150K observation sample are more relevant. what is the best practice?
3. Last, can i use the variable importance generated by RF in other ensemple
techniques to predict the accuracy?
Can you please help me out as am bit confused on which w
The preference between Sensitivity and Positive Predictive value depends on your ultimate goal of the analysis. The difference between these two values is nicely explained here: https://onlinecourses.science.psu.edu/stat507/node/71/
Altogether, these are two measures that look at the results from two different perspectives. Sensitivity gives you a probability that a test will find a "condition" among those you have it. Positive Predictive value looks at the prevalence of the "condition" among those who is being tested.
Accuracy is depends on the outcome of your classification: it is defined as (true positive + true negative)/(total), not variable importance's generated by RF.
Also, it is possible to compensate for the imbalances in the dataset, see https://stats.stackexchange.com/questions/264798/random-forest-unbalanced-dataset-for-training-test

Does SPSS adjust chi squared tests for weighted samples?

In SPSS, you can adjust summary statistics for stratified samples using
Weight by ..., and it allows you to then do a chi-squared test. I found a lot of examples of people doing a chi-squared like this, but nobody mentioning whether SPSS actually accounts for this in the chi square calculation.
Does standard SPSS adjust the chi squared test of independence for the stratified sample after using Weight by ...?
The R "survey" package uses Rao-Scott correction. Similarily, SAS has a Rao-Scott Chi-square test.
I'm aware of the SPSS complex surveys extension - while I'd be curious how well that one works, here I'm specifically interested in whether base SPSS does this correctly.
Thanks a lot for your help!
The answer is No. Base SPSS does not adjust standard errors for stratified samples.
All calculations in SPSS on weighted samples that use the standard error (p-values, confidence intervals, ...) will be wrong without the Complex Surveys extension.
Confimed by IBM, UCLA, PARE and Thomas Lumley.
All sources recommended by Thomas Lumley.

How to interpret k-means output in SPSS

SPSS: K-means analysis. What criteria can I use to state my choice of the number of final clusters I choose. Using a hierarchical cluster analysis, I started with 2 clusters in my K-mean analysis. However, after running many other k-means with different number of clusters, I dont knwo how to choose which one is better. Is there a general method of choosing the number of clusters that is scientifically right.
Are you using SPSS Modeler or SPSS Statistics? Because I created an extension to determine the optimal number of clusters.
It is based on R: Cluster analysis in R: determine the optimal number of clusters

Resources