CV=integer vs predefined splits in GridSearchCV - python-3.x

What's the difference between setting CV=some integer vs cv=PredefinedSplit(test_fold=your_test_fold)?
Is there any advantage of one over the other? Does CV=some integer sets the splits randomly?

Specifying an integer will produce kfold cross-validation without shuffling, as described in the documentation for sklearn.model_selection.KFold. Shuffling before splitting may or may not be preferred; if your data is sorted, shuffling is necessary to randomize the distribution of samples, while if the samples are simply correlated due to spatial or temporal sampling effects, shuffling may provide an optimistic view of performance.
I would avoid using PredefinedSplit unless you have a very good reason to predefine your splits. There are other CV generators that can probably meet your needs, like StratifiedKFold if you want to maintain your class distribution (for example.)

Related

how can I simplify BoWs?

I'm trying to apply some binary text classification but I don't feel that having millions of >1k length vectors is a good idea. So, which alternatives are there for the basic BOW model?
I think there are quite a few different approaches, based on what exactly you are aiming for in your prediction task (processing speed over accuracy, variance in your text data distribution, etc.).
Without any further information on your current implementation, I think the following avenues offer ways for improvement in your approach:
Using sparse data representations. This might be a very obvious point, but choosing the right data structure to represent your input vectors can already save you a great deal of pain. Sklearn offers a variety of options, and detail them in their great user guide. Specifically, I would point out that you could either use scipy.sparse matrices, or alternatively represent something with sklearn's DictVectorizer.
Limit your vocabulary. There might be some words that you can easily ignore when building your BoW representation. I'm again assuming that you're working with some implementation similar to sklearn's CountVectorizer, which already offers a great number of possibilities. The most obvious option are stopwords, which can simply be dropped from your vocabulary entirely, but of course you can also limit it further by using pre-processing steps such as lemmatization/stemming, lowercasing, etc. CountVectorizer specifically also allows you to control the minimum and maximum document frequency (don't confuse this with corpus frequency), which again should limit the size of your vocabulary.

how to choose the best vector_size for doc2vec?

I am comparing techniques and want to find out what is the best method to vector and reduce dimensions of a large number of text documents. I have already tested Bag of Words and TF-IDF and reduced dimensions with PCA, SVD, and NMF. Using these approaches I can reduce my data and know the best number of dimensions based on the variance explained.
However, I want to do the same with doc2vec, considering that doc2vec itself is a dimensional reducer, what is the best approach to find out the number of dimensions for my model? Is there any statistical measure that helps me find the best number of vector_size?
Thanks in advance!
There's no magic indicator for what's best; you should try a range of dimensionalities to see what scores well on your specific downstream evaluations, given your data & goals.
If using a doc2vec implementation that offers inference of out-of-training set documents (such as via the .infer_vector() method in Python gensim library), then a plausible sanity check for eliminating very-bad choices of vector_size (or other parameters) is to re-infer vectors for training-set documents.
If repeated re-inferences of the same text are are generally "close to" each other, and to the vector for that same document created by the full model training, that's a weak indicator that the model is at least behaving in a self-consistent way. (If the spread of results is large, that might indicate potential problems with insufficient data, too few training epochs, a too-large/overfit model, or other foundational issues.)

when to use min-max-scalar and standard-scalar

When it is referred to use min-max-scaler and when Standard Scalar.
I think it depends on the data. Is there any features of data to look on to decide to go for which preprocessing method.
I looked at the docs but can someone give me more insight into it.
The scaling will indeed depend of the type of data that you will. For most cases, StandardScaler is the scaler of choice. If you know that you have some outliers, go for the RobustScaler.
Then, you deal with some features with a weird distribution like for instance the digits, it will not be the best to use these scalers. Indeed, on this dataset, there a lot of pixel at zero meaning that you have a pick at zero for this distribution involving that dividing by the std. dev. will not be beneficial. So basically when the distribution of a feature is far to be Normal then you need to take an alternative.
In the case of the digits, the MinMaxScaler is a much better choice. However, if you want to keep the zero at zeros (because you use sparse matrices), you will go for a MaxAbsScaler.
NB: also look at the QuantileTransformer and the PowerTransformer if you want a feature to follow a Normal/Uniform distribution whatever the original distribution was.
I hope this helps.
When to use MinMaxScaler, RobustScaler, StandardScaler, and Normalizer
https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02
StandardScaler
StandardScaler assumes that data usually has distributed features and will scale them to zero mean and 1 standard deviation. Use StandardScaler() if you know the data distribution is normal. For most cases, StandardScaler would do no harm. Especially when dealing with variance (PCA, clustering, logistic regression, SVMs, perceptrons, neural networks) in fact Standard Scaler would be very important. On the other hand, it will not make much of a difference if you are using tree-based classifiers or regressors.
MinMaxScaler
MinMaxScaler will transform each value in the column proportionally within the range [0,1]. This is quite acceptable in cases where we are not concerned about the standardisation along the variance axes. e.g. image processing or neural networks expecting values between 0 to 1.
Guide to Scaling and Standardizing
Compare the effect of different scalers on data with outliers

How does VectorSlicer work in Spark 2.0?

In the Spark official documentation,
VectorSlicer is a transformer that takes a feature vector and outputs a new feature vector with a sub-array of the original features. It is useful for extracting features from a vector column.
Does this select the important features from the set of features?
If that is the case how is it done without the mention of a dependent variable?
I am trying to perform data clustering and I need the important features which will contribute to the clusters better. Can I use VectorSlicer for this?
Does this select the important features from the set of features?
It doesn't. It literally slices the vector to select only specified indices.
and need the important features which will contribute to the clusters better.
If you have categorical data consider using ChiSqSelector.
Otherwise you can use dimensionality reduction like PCA. It won't be the same as feature selection but should provide similar benefits (keep only the most important signals, discard the rest).

RandomForestClassifier vs ExtraTreesClassifier in scikit learn

Can anyone explain the difference between the RandomForestClassifier and ExtraTreesClassifier in scikit learn. I've spent a good bit of time reading the paper:
P. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized trees”, Machine Learning, 63(1), 3-42, 2006
It seems these are the difference for ET:
1) When choosing variables at a split, samples are drawn from the entire training set instead of a bootstrap sample of the training set.
2) Splits are chosen completely at random from the range of values in the sample at each split.
The result from these two things are many more "leaves".
Yes both conclusions are correct, although the Random Forest implementation in scikit-learn makes it possible to enable or disable the bootstrap resampling.
In practice, RFs are often more compact than ETs. ETs are generally cheaper to train from a computational point of view but can grow much bigger. ETs can sometime generalize better than RFs but it's hard to guess when it's the case without trying both first (and tuning n_estimators, max_features and min_samples_split by cross-validated grid search).
ExtraTrees classifier always tests random splits over fraction of features (in contrast to RandomForest, which tests all possible splits over fraction of features)
The main difference between random forests and extra trees (usually called extreme random forests) lies in the fact that, instead of computing the locally optimal feature/split combination (for the random forest), for each feature under consideration, a random value is selected for the split (for the extra trees). Here is a good resource to know more about their difference in more detail Random forest vs extra tree.

Resources