How to shrink a bag-of-words model? - scikit-learn

The question title says it all: How can I make a bag-of-words model smaller? I use a Random Forest and a bag-of-words feature set. My model reaches 30 GB in size and I am sure that most words in the feature set do not contribute to the overall performance.
How to shrink a big bag-of-words model without losing (too much) performance?

Use feature selection. Feature selection removes features from your dataset based on their distribution with regards to your labels, using some scoring function.
Features that rarely occur, or occur randomly with all your labels, for example, are very unlikely to contribute to accurate classification, and get low scores.
Here's an example using sklearn:
from sklearn.feature_selection import SelectPercentile
# Assume some matrix X and labels y
# 10 means only include the 10% best features
selector = SelectPercentile(percentile=10)
# A feature space with only 10% of the features
X_new = selector.fit_transform(X, y)
# See the scores for all features
As always, be sure to only call fit_transform on your training data. When using dev or test data, only use transform. See here for additional documentation.
Note that there is also a SelectKBest, which does the same, but which allows you to specify an absolute number of features to keep, instead of a percentage.

If you don't want to change the architecture of your neural network and you are only trying to reduce the memory footprint, a tweak that can be made is to reduce the terms annotated by the CountVectorizer.
From the scikit-learn documentation, we have (at least) three parameter for reduce the vocabulary size.
max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
In first instance, try to play with max_df and min_df. If the size is still not suitable with your requirements, you can drop the size as you like using the max_features.
The max_features tuning can drop your classification accuracy by an higher ratio than the other parameters


why is sklearn.feature_selection.RFECV giving different results for each run

I tried to do feature selection with RFECV but it is giving out different results each time, does cross-validation divide the sample X into random chunks or into sequential deterministic chunks?
Also, why is the score different for grid_scores_ and score(X,y)? why are the scores sometimes negative?
Does cross-validation divide the sample X into random chunks or into sequential deterministic chunks?
CV divides the data into deterministic chunks by default. You can change this behaviour by setting the shuffle parameter to True.
However, RFECV uses sklearn.model_selection.StratifiedKFold if the y is binary or multiclass.
This means that it will split the data such that each fold has the same (or nearly the same ratio of classes). In order to do this, the exact data in each fold can change slightly in different iterations of CV. However, this should not cause major changes in the data.
If you are passing a CV iterator using the cv parameter, then you can fix the splits by specifying a random state. The random state is linked to random decisions made by the algorithm. Using the same random state each time will ensure the same behaviour.
Also, why is the score different for grid_scores_ and score(X,y)?
grid_scores_ is an array of cross-validation scores. grid_scores_[i] is the cross-validation score for the i-th iteration. This means that the first score is the score for all features, the second is the score when one set of features is removed and so on. The number of features removed in each is equal to the value of the step parameter. This is = 1 by default.
score(X, y) selects the optimal number of features and returns the score for those features.
why are the scores sometimes negative?
This depends on the estimator and scorer you are using. If you have set no scorer RFECV will use the default score function for the estimator. Generally, this is accuracy, but in your particular case, might be something that returns a negative value.

Scikit-Learn Vectorizer `max_features`

How do I choose the number of the max_features parameter in TfidfVectorizer module? Should I use the maximum number of elements in the data?
The description of the parameter does not give me a clear vision of how to choose the value for it:
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
This parameter is absolutely optional and should be calibrated according to the rational thinking and the data structure.
Sometimes it is not effective to transform the whole vocabulary, as the data may have some exceptionally rare words, which, if passed to TfidfVectorizer().fit(), will add unwanted dimensions to inputs in the future. One of the appropriate techniques in this case, for instance, would be to print out word frequences accross documents and then set a certain threshold for them. Imagine you have set a threshold of 50, and your data corpus consists of 100 words. After looking at the word frequences 20 words occur less than 50 times. Thus, you set max_features=80 and you are good to go.
If max_features is set to None, then the whole corpus is considered during the TF-IDF transformation. Otherwise, if you pass, say, 5 to max_features, that would mean creating a feature matrix out of the most 5 frequent words accross text documents.
Quick example
Assume you work with hardware-related documents. Your raw data is the following:
from sklearn.feature_extraction.text import TfidfVectorizer
data = ['gpu processor cpu performance',
'gpu performance ram computer',
'cpu computer ram processor jeans']
You see the word jeans in the third document is hardly related and occures only once in the whole dataset. The best way to omit the word, of course, would be to use stop_words parameter, but imagine if there are plenty of such words; or words that are related to the topic but occur scarcely. In the second case, the max_features parameter might help. If you proceed with max_features=None, then it will create a 3x7 sparse matrix, while the best-case scenario would be 3x6 matrix:
tf = TfidfVectorizer(max_features=None).fit(data)
tf.vocabulary_.__len__() # returns 7 as we passed 7 words
tf.fit_transform(data) # returns 3x7 sparse matrix
tf = TfidfVectorizer(max_features=6).fit(data) # excluding 'jeans'
tf.vocabulary_ # prints out every words except 'jeans'
tf.vocabulary_.__len__() # returns 6
tf.fit_transform(data) # returns 3x6 sparse matrix

Quantifying Text Keywords for Neural Network Analysis

I am working on a small research project. I am looking to write a program that
a) Takes a large number of short texts (~100 words / several thousand texts)
b) Identify keywords in the texts
c) Presents all of them to a group of users who indicate if they found them interesting or not
d) Have the software learn what keywords or combinations are likely to be preferable. Let's assume that the target group is uniform for this example.
Now, there are two main challenges. The first one I have an answer to, the second one I am looking for help with.
1) Keyword identification.
Reverse frequency analysis seems to be the way to go here. Identify those words that occur proportionally often in a given text when compared to all others. This has some drawbacks though as for example very common keywords may be overlooked.
2) How to prepare the data-set to be numeric. I could map keywords to input neurons and then adjust the value based on their relative frequency, but that limits the model and makes it hard to add new keywords. It also quickly becomes competitively expensive if we want to scale beyond a few dozen keywords.
How would this problem commonly be addressed?
This is a way to start with:
clean your input text (remove special tokens etc)
use n-grams as features (can just start with 1-gram).
treat user's feedback "preferrable or not" as a binary label.
learn a binary classifier (whatever model is fine, naive bayesian, logistic regression).
1) Keyword identification. Reverse frequency analysis seems to be the way to go here. Identify those words that occur proportionally often in a given text when compared to all others. This has some drawbacks though as for example very common keywords may be overlooked.
You can skip this part in the first model you built. Treat the sentence as bag of words(n-grams) to simplify the first working model. If you want, you can add this as feature weight later.
2) How to prepare the data-set to be numeric. I could map keywords to input neurons and then adjust the value based on their relative frequency, but that limits the model and makes it hard to add new keywords. It also quickly becomes competitively expensive if we want to scale beyond a few dozen keywords
You can just use a dictionary mapping n-grams to integer ids. For each training example, the feature would be sparse hence you have training examples like below:
34, 68, 79293, 23232 -> 0 (negative label)
340, 608, 3, 232 -> 1 (positive label)
Imagine you have a dictionary (or vocabulary) mapping:
3: foo
34: movie
68: in-stock
232: bar
340: barz
TO use neural networks, you will need to have an embedding layer to turn sparse features into dense features by aggregating (for instance, averaging) the embedding vectors of all features.
Use the same example as above, suppose we just use 4-dimensional embedding:
34 -> [0.1, 0.2, -0.3, 0]
68 -> [0, 0.1, -0.1, 0.2]
79293 -> [0.3, 0.0, 0.12, 0]
23232 -> [0.4, 0.0, 0.0, 0]
------------------------------- sum
sum -> [0.8, 0.3, -0.28, 0.2]
------------------------------- L1-normalize
l1 -> [0.8, 0.3, -0.28, 0.2] ./ (0.8 + 0.3 + 0.28 + 0.2)
-> [0.51,0.19,-0.18,0.13]
At prediction time, you will need to use the dictionary and the same way of feature extraction (cleanup/n-gram generation/mapping n-gram to ids) so that your model understands the input.
You can simply use sklearn to learn a TFIDF bag of words model of your texts which returns a sparse matrix n_samplesxn_features like this:
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = TfidfTransformer(smooth_idf=False)
X_train = vectorizer.fit_transform(list_of_texts)
X_train is a scipy csr sparse matrix. If your NN implementation doesn't support sparse matrices you can convert it to a numpy dense matrix but it might fill your RAM; better to use an implementation that supports sparseinput (e.g. I know Lasagne/Theano does that).
After training, you can use the parameters of the NN to find out which features have a high/low weight and so are more/less important for the particular label.

Spark Naive Bayes Result accuracy (Spark ML 1.6.0) [duplicate]

I am using Spark ML to optimise a Naive Bayes multi-class classifier.
I have about 300 categories and I am classifying text documents.
The training set is balanced enough and there is about 300 training examples for each category.
All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero).
What are the possible reasons for this?
I would like to add that in SPARK ML there is something called "raw prediction" and when I look at it, I can see negative numbers but they have more or less comparable magnitude, so even the category with the high probability has comparable raw prediction score, but I am finding difficulties in interpreting this scores.
Lets start with a very informal description of Naive Bayes classifier. If C is a set of all classes and d is a document and xi are the features, Naive Bayes returns:
Since P(d) is the same for all classes we can simplify this to
Since we assume that features are conditionally independent (that is why it is naive) we can further simplify this (with Laplace correction to avoid zeros) to:
Problem with this expression is that in any non-trivial case it is numerically equal to zero. To avoid we use following property:
and replace initial condition with:
These are the values you get as the raw probabilities. Since each element is negative (logarithm of the value in (0, 1]) a whole expression has negative value as well. As you discovered by yourself these values are further normalized so the maximum value is equal to 1 and divided by the sum of the normalized values
It is important to note that while values you get are not strictly P(c|d) they preserve all important properties. The order and ratios are exactly (ignoring possible numerical issues) the same. If none other class gets prediction close to one it means that, given the evidence, it is a very strong prediction. So it is actually something you want to see.

SVM integer features

I'm using the SVM classifier in the machine learning scikit-learn package for python.
My features are integers. When I call the fit function, I get the user warning "Scaler assumes floating point values as input, got int32", the SVM returns its prediction, I calculate the confusion matrix (I have 2 classes) and the prediction accuracy.
I've tried to avoid the user warning, so I saved the features as floats. Indeed, the warning disappeared, but I got a completely different confusion matrix and prediction accuracy (surprisingly much less accurate)
Does someone know why it happens? What is preferable, should I send the features as float or integers?
You should convert them as floats but the way to do it depends on what the integer features actually represent.
What is the meaning of your integers? Are they category membership indicators (for instance: 1 == sport, 2 == business, 3 == media, 4 == people...) or numerical measures with an order relationship (3 is larger than 2 that is in turn is larger than 1). You cannot say that "people" is larger than "media" for instance. It is meaningless and would confuse the machine learning algorithm to give it this assumption.
Categorical features should hence be transformed to explode each feature as several boolean features (with value 0.0 or 1.0) for each possible category. Have a look at the DictVectorizer class in scikit-learn to better understand what I mean by categorical features.
If there are numerical values just convert them as floats and maybe use the Scaler to have them loosely in the range [-1, 1]. If they span several order of magnitudes (e.g. counts of word occurrences) then taking the logarithm of the counts might yield better results. More documentation on feature preprocessing and examples in this section of the documentation:
Edit: also read this guide that has many more details for features representation and preprocessing:
