Mixed variable multi-factor analysis (MFA) in caret: Is it possible? Can I employ any FactoMineR::FAMD output? - modeling

I have a dataset with mixed categorical and numerical variables that I would like to reduce to principle components and then feed into caret::train for modeling. I am trying to figure out if this is even possible. I see that FactoMineR::FAMD provides mixed data factorial analysis but I am not sure how to use this information to transform my training and testing datasets to use in caret.
If this is not possible, does anyone know of an alternative method for addressing the issue of the reduction of mixed data?

Related

how can I simplify BoWs?

I'm trying to apply some binary text classification but I don't feel that having millions of >1k length vectors is a good idea. So, which alternatives are there for the basic BOW model?
I think there are quite a few different approaches, based on what exactly you are aiming for in your prediction task (processing speed over accuracy, variance in your text data distribution, etc.).
Without any further information on your current implementation, I think the following avenues offer ways for improvement in your approach:
Using sparse data representations. This might be a very obvious point, but choosing the right data structure to represent your input vectors can already save you a great deal of pain. Sklearn offers a variety of options, and detail them in their great user guide. Specifically, I would point out that you could either use scipy.sparse matrices, or alternatively represent something with sklearn's DictVectorizer.
Limit your vocabulary. There might be some words that you can easily ignore when building your BoW representation. I'm again assuming that you're working with some implementation similar to sklearn's CountVectorizer, which already offers a great number of possibilities. The most obvious option are stopwords, which can simply be dropped from your vocabulary entirely, but of course you can also limit it further by using pre-processing steps such as lemmatization/stemming, lowercasing, etc. CountVectorizer specifically also allows you to control the minimum and maximum document frequency (don't confuse this with corpus frequency), which again should limit the size of your vocabulary.

How to see correlation between features in scikit-learn?

I am developing a model in which it predicts whether the employee retains its job or leave the company.
The features are as below
satisfaction_level
last_evaluation
number_projects
average_monthly_hours
time_spend_company
work_accident
promotion_last_5years
Department
salary
left (boolean)
During feature analysis, I came up with the two approaches and in both of them, I got different results for the features. as shown in the image
here
When I plot a heatmap it can be seen that satisfaction_level has a negative correlation with left.
On the other hand, if I just use pandas for analysis I got results something like this
In the above image, it can be seen that satisfaction_level is quite important in the analysis since employees with higher satisfaction_level retain the job.
While in the case of time_spend_company the heatmap shows it is important while on the other hand, the difference is not quite important in the second image.
Now I am confused about whether to take this as one of my features or not and which approach should I choose in order to choose features.
Some please help me with this.
BTW I am doing ML in scikit-learn and the data is taken from here.
Correlation between features have little to do with feature importance. Your heat map is correctly showing correlation.
In fact, in most of the cases when you talking about feature importance, you must provide context of a model that you are using. Different models may choose different features as important. Moreover many models assume that data comes from IID (Independent and identically distributed random variables), so correlation close to zero is desirable.
For example in sklearn learn regression to get estimation of feature importance you can examine coef_ parameter.

How does VectorSlicer work in Spark 2.0?

In the Spark official documentation,
VectorSlicer is a transformer that takes a feature vector and outputs a new feature vector with a sub-array of the original features. It is useful for extracting features from a vector column.
Does this select the important features from the set of features?
If that is the case how is it done without the mention of a dependent variable?
I am trying to perform data clustering and I need the important features which will contribute to the clusters better. Can I use VectorSlicer for this?
Does this select the important features from the set of features?
It doesn't. It literally slices the vector to select only specified indices.
and need the important features which will contribute to the clusters better.
If you have categorical data consider using ChiSqSelector.
Otherwise you can use dimensionality reduction like PCA. It won't be the same as feature selection but should provide similar benefits (keep only the most important signals, discard the rest).

Suitable data mining technique for this dataset

I'm working on a data mining project and would like to mine this dataset Higher Education Enrolments for interesting patterns or knowledge. My problem is figuring out which technique would work best for the dataset.
I'm currently working on the dataset using RapidMiner 5.0 and I removed two columns (E550 - Reference year, E931 - Total Student EFTSL) from the data as they would not be relevant to the analysis. The rest of the attributes are nominal except StudentID (integer) which I have used as my id. I'm currently using classification on it (Naive Bayes) but would like to get the opinion of others, hopefully those who have had more experience in this area. Thanks.
The best technique depends on many factors: type/distribution of training and target attribute, domain, value range of attributes, etc. The best technique to use is the result of data analysis and understanding.
In this particular case, you should clarify which is the attribute to predict.
Unless you already know what you are looking for, and know about the quality of the data source, you should always start by trying various exploratory analysis:
look at some of the first and second order statistics of all the
variables
generate histograms of each variable, to get an idea of the empirical
distribution of each
take a look at pairwise scatter plots of variables that might have
dependency
try other visualization that you might think of
These would give you a rough idea about what kind of pattern might be present and might be discoverable given the noise level. Then depending on what kind of pattern you are interested in, you could start trying various unsupervised pattern learning methods such as, PCA/ICA/factor analysis, clustering, or supervised methods, such as regression, classification.

Is it possible to compare the classification ability of two sets of features by ROC?

I am learning about SVM and ROC. As I know, people can usually use a ROC(receiver operating characteristic) curve to show classification ability of a SVM (Support Vector Machine). I am wondering if I can use the same concept to compare two subsets of features.
Assume I have two subsets of features, subset A and subset B. They are chosen from the same train data by 2 different features extraction methods, A and B. If I use these two subsets of features to train the same SVM by using the LIBSVM svmtrain() function and plot the ROC curves for both of them, can I compare their classification ability by their AUC values ? So if I have a higher AUC value for subsetA than subsetB, can I conclude that method A is better than method B ? Does it make any sense ?
Thank you very much,
Yes, you are on the right track. However, you need to keep a few things in mind.
Often using the two features A and B with the appropriate scaling/normalization can give better performance than the features individually. So you might also consider the possibility of using both the features A and B together.
When training SVMs using features A and B, you should optimize for them separately, i.e. compare the best performance obtained using feature A with the best obtained using feature B. Often the features A and B might give their best performance with different kernels and parameter settings.
There are other metrics apart from AUC, such as F1-score, Mean Average Precision(MAP) that can be computed once you have evaluated the test data and depending on the application that you have in mind, they might be more suitable.

Resources