What is n_components in skikit-learn cca - scikit-learn

I am new to Canonical Correlation Analysis. Going through the sklearn cca is the n_components field and I could not find any conceptual description of what the variable is for anywhere. What exactly is the variable and how do I decide the value I should be setting it?

The n_components variable is the number of components you want to keep in your CCA model.
You can think of it as the number of dimensions that you want to represent your data in.
There are a few ways to decide how many components to keep in your model.
One way to decide how many components to keep is to look at the canonical correlations.
The canonical correlations represent the correlation between the two sets of variables that are being transformed by the CCA.
If you only want to keep the dimensions that have the highest correlation, you can set n_components to be the number of components with the highest canonical correlations.
There is no right or wrong answer for how many components to keep in your CCA model.
It really depends on what you are trying to achieve with your CCA model.

Related

Normality Assumption - how to check you have not violated it?

I am rleatively new to statistics and am stuggling with the normality assumption.
I understand that parametric tests are underpinned by the assumption that the data is normally distributed, but there seems to be lots of papers and articles providing conflicting information.
Some articles say that independant variables need to be normally disrbiuted and this may require a transformation (log, SQRT etc.). Others says that in linear modelling there are no assumptions about any linear the distribution of the independent variables.
I am trying to create a multiple regression model to predict highest pain scores on hospital admissions:
DV: numeric pain scores (0-no pain -> 5 intense pain)(discrete- dependant variable).
IVs: age (continuous), weight (continuous), sex (nominal), depreviation status (ordinal), race (nominal).
Can someone help clear up the following for me?
Before fitting a model, do I need to check the whether my independant variables are normally distributed? If so, why? Does this only apply to continuous variables (e.g. age and weight in my model)?
If age is positively skewed, would a transformation (e.g. log, SQRT) be appropriate and why? Is it best to do this before or after fitting a model? I assume I am trying to get close to a linear relationship between my DV and IV.
As part of the SPSS outputs it provides plots of the standardised residuals against predicted values and also normal P-P plots of standardised residuals. Are these tests all that is needed to check the normality assumption after fitting a model?
Many Thanks in advance!

Is there a built in sklearn cross validation function for only validating against higher indices?

I'd like to use GridSearchCV, but with the condition that the lowest index of the data in the validation set be greater than the largest in the training set. The reason being that the data is in time, and future data gives unfair insight that would inflate the score. There's some discussion on this:
If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.
but it's not clear to me whether any of the splitting methods listed can accomplish what i'm looking for. It seems to be the case that I can define an itterable of indices and pass that into cv, but in that case it's not clear how many I should define (does it always use all of them? do different tests get different indices?)

Tuned model with GroupKFold Cross-Validaion requires Group parameter when Predicting

I tuned a RandomForest with GroupKFold (to prevent data leakage because some rows came from the same group).
I get a best fit model, but when I go to make a prediction on the test data it says that it needs the group feature.
Does that make sense? Its odd that the group feature is coming up as one of the most important features as well.
I'm just wondering if there is something I could be doing wrong.
Thanks
A search on the scikit-learn Github repo does not reveal a single instance of the string "group feature" or "group_feature" or anything similar, so I will go ahead and assume you have in your data set a feature called "group" that the prediction model requires as input in order to produce an output.
Remember that a prediction model is basically a function that takes an input (the "predictor" variable) and returns an output (the "predicted" variable). If a variable called "group" was defined as input for your prediction model, then it makes sense that scikit-learn would request it.
Does the group appear as a column on the training set? If so, remove it and re-train. It looks like you are just using it to generate splits. If it isn't a part of the input data you need to predict, it shouldn't be in the training set.

What's the selection criterion for the "best" model in "regsubsets" & how do I access several "best" models?

I've been playing around with the regsubsets function a bit, using the "forward" method to select variables for a linear regression model. However, despite also reading the documentation I can't seem to figure out, how the leaps.setup underlying this function determines the "best" model for each separate number of variables in a model.
Say I have a model with potential 10 variables in it (and nvmax = 10), I get exactly one "best" model for a model with 1 var, 2 vars etc. But how is this model selected by the function? I wonder particularly because after having run this function, I'm able to extract the best model of all models with different(!) sizes by determining a specific criterion (e.g., adjr2).
Related to this, I wonder: If I set, for example, nbest = 5 I understand that the function calculates the five best models for each model size (i.e., for a model with ten variables it gives five different variations that perform better than the rest). If I understand that correctly, is there any way to extract these five models for a specific model size? That is, for example, display the coefficients of these five best models?
I hope, I'm being clear about my problems here... Please, let me know, if exemplary data or any further information will help to clarify the issue!
The "best" model picked by regsubsets is the one that minimizes the sum of the squares of the residuals.
I'm still working on the second question...
Addressing the second question: the next code displays the coefficients of the 5 best models for each quantity of explanatory variables, from 1 to 3 variables. Y is the response variable of the models.
library(leaps)
best_models = regsubsets( Y ~ ., data = data_set, nvmax=3, nbest=5)
coef(best_models, 1:15)

How to see correlation between features in scikit-learn?

I am developing a model in which it predicts whether the employee retains its job or leave the company.
The features are as below
satisfaction_level
last_evaluation
number_projects
average_monthly_hours
time_spend_company
work_accident
promotion_last_5years
Department
salary
left (boolean)
During feature analysis, I came up with the two approaches and in both of them, I got different results for the features. as shown in the image
here
When I plot a heatmap it can be seen that satisfaction_level has a negative correlation with left.
On the other hand, if I just use pandas for analysis I got results something like this
In the above image, it can be seen that satisfaction_level is quite important in the analysis since employees with higher satisfaction_level retain the job.
While in the case of time_spend_company the heatmap shows it is important while on the other hand, the difference is not quite important in the second image.
Now I am confused about whether to take this as one of my features or not and which approach should I choose in order to choose features.
Some please help me with this.
BTW I am doing ML in scikit-learn and the data is taken from here.
Correlation between features have little to do with feature importance. Your heat map is correctly showing correlation.
In fact, in most of the cases when you talking about feature importance, you must provide context of a model that you are using. Different models may choose different features as important. Moreover many models assume that data comes from IID (Independent and identically distributed random variables), so correlation close to zero is desirable.
For example in sklearn learn regression to get estimation of feature importance you can examine coef_ parameter.

Resources