Multicollinearity of categorical variables - statistics

What are the different measures available to check for multicollinearity if the data contains both categorical and continuous independent variables?
Can I use VIF by converting categorical variables into dummy variables ? Is there a fundamental flaw in this since I could not locate any reference material on the internet ?

Can I use VIF by converting categorical variables into dummy variables ?
Yes, you can. There is no fundamental flaw in this approach.
if the data contains both categorical and continuous independent variables?
Multicollinearity doesn’t care if it’s a categorical variable or an integer variable. There is nothing special about categorical variables. Convert your categorical variables into binary, and treat them as all other variables.
I assume your concern would be categorical variables must be correlated to each other and it's a valid concern. Suppose the case when the proportion of cases in the reference category is small. Let's say there are 3 categorical variables: Overweight, normal, underweight. We can turn this into 2 categorical variable. Then, if one category's data is very small (like normal people are 5 out of 100 and all other 95 people are underweight or overweight), the indicator variables will necessarily have high VIFs, even if the categorical variable is not associated with other variables in the regression model.
What are the different measures available to check for multicollinearity
One way to detect multicollinearity is to take the correlation matrix of your data, and check the eigen values of the correlation matrix.
Eigen values close to 0 indicate the data are correlated.

Related

Can I use principal component analysis to reduce my data to a single dimension?

I have a dataset with 10 variables and I am looking to reduce it to a single "score". I understand the basics of PCA, it uses the covariance matrix of the variables to create 10 eigenvectors and 10 eigenvalues. Normally what is done is people multiply the eigenvectors by the normalized data to generate principal components, they pick some arbitrary number of principal components, and throw them into a regression and get the fitted value. In other words, the coefficients multiplied by the principal components allow for a data reduction to a single variable.
My question is whether I need to do the regression step (I do not have a dependent variable). Instead of doing the regression, could I use the eigenvalues as my coefficients? In other words, could I take the inner product of the vector of eigenvalues and the principal component to create a single variable?
I've never seen it used this way (and as far as I'm aware no one has even asked this question), but it seems intuitive to me. Am I missing something or is this legit and I just haven't looked in the right places? Thanks!
Do you have a target variable?
Because there are two things here,
principal component regression and
principal components analysis.
If you have a target variable (a variable you want to predict or explain) the regression if not the second.
PC Regression --> you need a target variable
PC Analysis --> you do not need any regression step.
I think you are in the second case. So you just want to reduce your 10 variables into 1 that summarize everything.
Perhaps and interesting tutorial in Python: https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

Defining data in classification trees: ordinal vs nominal

My question is, when building a decision tree in sklearn, if I have a categorical variable, is there a problem if I manually input the values of the variable as numbers? (assuming the dataframe is small)
And, will there be difference in results if my variable is nominal or ordinal?
I don't think there should be much difference since the theory says that you should look for the best combination in terms of entropy and other metrics, so it shouldn't care if one value is smaller than another.
Thank you very much
There are differences if your categorical variable is ordinal or nominal:
If your variable is ordinal, you can just change each categories for a number (for example: bad, normal, good can be changed for 1,2,3). Note that you are keeping only one column. You can do it manually if you have few samples. You can use LabelEncoder from sklearn to do it.
If your variable is not ordinal you have to add new columns to you dataset, one for each category. You can do it manually, but I would recommend use pd.get_dummies().
To sump up, you have to be very careful knowing if the categorical variable is ordinal or not. And you can deal with them manually (you would obtain same results), but it's recommend to use functions predefined to avoid some mistakes.

Does fitting a linear regression and performing t test will give similar results?

I am trying to predict the statistically significant variables out of a list of binary variables. I am having a conceptual doubt in the below mentioned 2 approaches to find the relevant variables.
Dependent variable:
Height of a person
Independent variables:
Gender(Male or Female)
Financial_Status(Below Poverty Line or not)
College_Graduate(Yes or No)
Approach 1: Fitting a linear regression while taking these as dependent/independent variables and finding the statistically significant variables
Approach 2: Performing an individual statistical test for each dependent variable(t-test or some other relevant test) to compute the statistically significant variables
Are both of these approaches similar and will give similar results? If not, what's the exact difference?
Since you have multiple independent variables, than clearly no.
If you would like to go for the ttest approach for each of the values of the different independent variables (Gender, Financial_Status and College_Graduate) then it means you'll perform 3 different tests. Performing multiple tests is something that is risky in terms of false positive results, and thus should be adjusted with a multiple comparison adjustment method (Bonferoni, FDR, among others).
On the other hand, if you'll use a single multiavariate linear regression you wouldn't have the correct for multiple comparisons, which is why, in my opinion, is the better approach.

spark - MLlib: transform and manage categorical features

For big datasets with 2bil+ samples and approximately 100+ features per sample. Among these, 10% features you have are numerical/continuous variables and the rest of it are categorical variables (position, languages, url etc...).
Let's use some examples:
e.g: dummy categorical feature
feature: Position
real values: SUD | CENTRE | NORTH
encoded values: 1 | 2 | 3
...would have sense use reduction like SVD because distance beetween sud:north > sud:centre and, moreover, it's possible to encode (e.g OneHotEncoder, StringIndexer) this variable because of the small cardinality of it values-set.
e.g: real categorical feature
feature: url
real values: very high cardinality
encoded values: ?????
1) In MLlibthe 90% of the model works just with numerical values (a part of Frequent Itemset and DecisionTree techniques)
2) Features transformers/reductor/extractor as PCA or SVD are not good for these kind of data, and there is no implementation of (e.g) MCA
a) Which could be your approach to engage with this kind of data in spark, or using Mllib?
b) Do you have any suggestions to cope with this much categorical values?
c) After reading a lot in literature, and counting the implemented model in spark, my idea, about make inference on one of that features using the others (categorical), the models at point 1 could be the best coiche. What do you think about it?
(to standardize a classical use case you can imagine the problem of infer the gender of a person using visited url and other categorical features).
Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example?
Thanks in advance
Well, first I would say stackoverflow works in a different way, you should be the one providing a working example with the problem you are facing and we help you out using this example.
Anyways I got intrigued with the use of the categorical values like the one you show as position. If this is a categorical value as you mention with 3 levels SUD,CENTRE, NORTH, there is no distance between them if they are truly categorical. In this sense I would create dummy variables like:
SUD_Cat CENTRE_Cat NORTH_Cat
SUD 1 0 0
CENTRE 0 1 0
NORTH 0 0 1
This is a truly dummy representation of a categorical variable.
On the other hand if you want to take that distance into account then you have to create another feature which takes this distance into account explicitly, but that is not a dummy representation.
If the problem you are facing is that after you wrote your categorical features as dummy variables (note that now all of them are numerical) you have very many features and you want to reduce your feature's space, then is a different problem.
As a rule of thumbs I try to utilize the entire feature space first, now a plus since in spark computing power allows you to run modelling tasks with big datasets, if it is too big then I would go for dimensionality reduction techniques, PCA etc...

In SAS is there a way to compute the "canonical correlation" between two categorical variables using just a proc?

I have two char variables in a dataset. I want to compute the canonical correlation between the two. By that I mean I want to create some dummy variables from the two categorical variables and compute the canonical correlation that way. After looking through proc cancorr I can not find a way to do this without first manually converting the categorical variables to dummy variables first. Is there a way to do that without manually converting the categorical variables to dummy variables first?
No need to do this manually. There are a few SAS macros which can do this for you. Here is one:
http://www.datavis.ca/sasmac/dummy.html
PROC GLMMOD will create a design matrix for you, which essentially means creating the dummy variables.

Resources