How does VectorSlicer work in Spark 2.0? - apache-spark

In the Spark official documentation,
VectorSlicer is a transformer that takes a feature vector and outputs a new feature vector with a sub-array of the original features. It is useful for extracting features from a vector column.
Does this select the important features from the set of features?
If that is the case how is it done without the mention of a dependent variable?
I am trying to perform data clustering and I need the important features which will contribute to the clusters better. Can I use VectorSlicer for this?

Does this select the important features from the set of features?
It doesn't. It literally slices the vector to select only specified indices.
and need the important features which will contribute to the clusters better.
If you have categorical data consider using ChiSqSelector.
Otherwise you can use dimensionality reduction like PCA. It won't be the same as feature selection but should provide similar benefits (keep only the most important signals, discard the rest).

Related

how can I simplify BoWs?

I'm trying to apply some binary text classification but I don't feel that having millions of >1k length vectors is a good idea. So, which alternatives are there for the basic BOW model?
I think there are quite a few different approaches, based on what exactly you are aiming for in your prediction task (processing speed over accuracy, variance in your text data distribution, etc.).
Without any further information on your current implementation, I think the following avenues offer ways for improvement in your approach:
Using sparse data representations. This might be a very obvious point, but choosing the right data structure to represent your input vectors can already save you a great deal of pain. Sklearn offers a variety of options, and detail them in their great user guide. Specifically, I would point out that you could either use scipy.sparse matrices, or alternatively represent something with sklearn's DictVectorizer.
Limit your vocabulary. There might be some words that you can easily ignore when building your BoW representation. I'm again assuming that you're working with some implementation similar to sklearn's CountVectorizer, which already offers a great number of possibilities. The most obvious option are stopwords, which can simply be dropped from your vocabulary entirely, but of course you can also limit it further by using pre-processing steps such as lemmatization/stemming, lowercasing, etc. CountVectorizer specifically also allows you to control the minimum and maximum document frequency (don't confuse this with corpus frequency), which again should limit the size of your vocabulary.

Mixed variable multi-factor analysis (MFA) in caret: Is it possible? Can I employ any FactoMineR::FAMD output?

I have a dataset with mixed categorical and numerical variables that I would like to reduce to principle components and then feed into caret::train for modeling. I am trying to figure out if this is even possible. I see that FactoMineR::FAMD provides mixed data factorial analysis but I am not sure how to use this information to transform my training and testing datasets to use in caret.
If this is not possible, does anyone know of an alternative method for addressing the issue of the reduction of mixed data?

How to see correlation between features in scikit-learn?

I am developing a model in which it predicts whether the employee retains its job or leave the company.
The features are as below
satisfaction_level
last_evaluation
number_projects
average_monthly_hours
time_spend_company
work_accident
promotion_last_5years
Department
salary
left (boolean)
During feature analysis, I came up with the two approaches and in both of them, I got different results for the features. as shown in the image
here
When I plot a heatmap it can be seen that satisfaction_level has a negative correlation with left.
On the other hand, if I just use pandas for analysis I got results something like this
In the above image, it can be seen that satisfaction_level is quite important in the analysis since employees with higher satisfaction_level retain the job.
While in the case of time_spend_company the heatmap shows it is important while on the other hand, the difference is not quite important in the second image.
Now I am confused about whether to take this as one of my features or not and which approach should I choose in order to choose features.
Some please help me with this.
BTW I am doing ML in scikit-learn and the data is taken from here.
Correlation between features have little to do with feature importance. Your heat map is correctly showing correlation.
In fact, in most of the cases when you talking about feature importance, you must provide context of a model that you are using. Different models may choose different features as important. Moreover many models assume that data comes from IID (Independent and identically distributed random variables), so correlation close to zero is desirable.
For example in sklearn learn regression to get estimation of feature importance you can examine coef_ parameter.

Using SVM to perform classification on multi-dimensional time series datasets

I would like to use scikit-learn's svm.SVC() estimator to perform classification tasks on multi-dimensional time series - that is, on time series where the points in the series take values in R^d, where d > 1.
The issue with doing this is that svm.SVC() will only take ndarray objects of dimension at most 2, whereas the dimension of such a dataset would be 3. Specifically, the shape of a given dataset would be (n_samples, n_features, d).
Is there a workaround available? One simple solution would just be to reshape the dataset so that it is 2-dimensional, however I imagine this would lead to the classifier not learning from the dataset properly.
Without any further knowledge about the data reshaping is the best you can do. Feature engineering is a very manual art that depends heavily on domain knowledge.
As a rule of thumb: if you don't really know anything about the data throw in the raw data and see if it works. If you have an idea what properties of the data may be beneficial for classification, try to work it in a feature.
Say we want to classify swiping patterns on a touch screen. This closely resembles your data: We acquired many time series of such patterns by recording the 2D position every few milliseconds.
In the raw data, each time series is characterized by n_timepoints * 2 features. We can use that directly for classification. If we have additional knowledge we can use that to create additional/alternative features.
Let's assume we want to distinguish between zig-zag and wavy patterns. In that case smoothness (however that is defined) may be a very informative feature that we can add as a further column to the raw data.
On the other hand, if we want to distinguish between slow and fast patterns, the instantaneous velocity may be a good feature. However, the velocity can be computed as a simple difference along the time axis. Even linear classifiers can model this easily so it may turn out that such features, although good in principle, do not improve classification of raw data.
If you have lots and lots and lots and lots of data (say an internet full of good examples) Deep Learning neural networks can automatically learn features to some extent, but let's say this is rather advanced. In the end, most practical applications come down to try and error. See what features you can come up with and try them out in practice. And beware the overfitting gremlin.

Is it possible to compare the classification ability of two sets of features by ROC?

I am learning about SVM and ROC. As I know, people can usually use a ROC(receiver operating characteristic) curve to show classification ability of a SVM (Support Vector Machine). I am wondering if I can use the same concept to compare two subsets of features.
Assume I have two subsets of features, subset A and subset B. They are chosen from the same train data by 2 different features extraction methods, A and B. If I use these two subsets of features to train the same SVM by using the LIBSVM svmtrain() function and plot the ROC curves for both of them, can I compare their classification ability by their AUC values ? So if I have a higher AUC value for subsetA than subsetB, can I conclude that method A is better than method B ? Does it make any sense ?
Thank you very much,
Yes, you are on the right track. However, you need to keep a few things in mind.
Often using the two features A and B with the appropriate scaling/normalization can give better performance than the features individually. So you might also consider the possibility of using both the features A and B together.
When training SVMs using features A and B, you should optimize for them separately, i.e. compare the best performance obtained using feature A with the best obtained using feature B. Often the features A and B might give their best performance with different kernels and parameter settings.
There are other metrics apart from AUC, such as F1-score, Mean Average Precision(MAP) that can be computed once you have evaluated the test data and depending on the application that you have in mind, they might be more suitable.

Resources