Scikit-Learn Vectorizer `max_features`

Scikit-Learn Vectorizer `max_features` - scikit-learn

How do I choose the number of the max_features parameter in TfidfVectorizer module? Should I use the maximum number of elements in the data?
The description of the parameter does not give me a clear vision of how to choose the value for it:
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.

This parameter is absolutely optional and should be calibrated according to the rational thinking and the data structure.
Sometimes it is not effective to transform the whole vocabulary, as the data may have some exceptionally rare words, which, if passed to TfidfVectorizer().fit(), will add unwanted dimensions to inputs in the future. One of the appropriate techniques in this case, for instance, would be to print out word frequences accross documents and then set a certain threshold for them. Imagine you have set a threshold of 50, and your data corpus consists of 100 words. After looking at the word frequences 20 words occur less than 50 times. Thus, you set max_features=80 and you are good to go.
If max_features is set to None, then the whole corpus is considered during the TF-IDF transformation. Otherwise, if you pass, say, 5 to max_features, that would mean creating a feature matrix out of the most 5 frequent words accross text documents.
Quick example
Assume you work with hardware-related documents. Your raw data is the following:
from sklearn.feature_extraction.text import TfidfVectorizer
data = ['gpu processor cpu performance',
'gpu performance ram computer',
'cpu computer ram processor jeans']
You see the word jeans in the third document is hardly related and occures only once in the whole dataset. The best way to omit the word, of course, would be to use stop_words parameter, but imagine if there are plenty of such words; or words that are related to the topic but occur scarcely. In the second case, the max_features parameter might help. If you proceed with max_features=None, then it will create a 3x7 sparse matrix, while the best-case scenario would be 3x6 matrix:
tf = TfidfVectorizer(max_features=None).fit(data)
tf.vocabulary_.__len__() # returns 7 as we passed 7 words
tf.fit_transform(data) # returns 3x7 sparse matrix
tf = TfidfVectorizer(max_features=6).fit(data) # excluding 'jeans'
tf.vocabulary_ # prints out every words except 'jeans'
tf.vocabulary_.__len__() # returns 6
tf.fit_transform(data) # returns 3x6 sparse matrix

Related

How to find the vocabulary size of a spaCy model?

I am trying to find the vocabulary size of the large English model, i.e. en_core_web_lg, and I find three different sources of information:
spaCy's docs: 685k keys, 685k unique vectors
nlp.vocab.__len__(): 1340242 # (number of lexemes)
len(vocab.strings): 1476045
What is the difference between the three? I have not been able to find the answer in the docs.

The most useful numbers are the ones related to word vectors. nlp.vocab.vectors.n_keys tells you how many tokens have word vectors and len(nlp.vocab.vectors) tells you how many unique word vectors there are (multiple tokens can refer to the same word vector in md models).
len(vocab) is the number of cached lexemes. In md and lg models most of those 1340242 lexemes have some precalculated features (like Token.prob) but there can be additional lexemes in this cache without precalculated features since more entries can be added as you process texts.
len(vocab.strings) is the number of strings related to both tokens and annotations (like nsubj or NOUN), so it's not a particularly useful number. All strings used anywhere in training or processing are stored here so that the internal integer hashes can be converted back to strings when needed.

Since spaCy 2.3+, according to the release notes, the lexemes are not loaded in nlp.vocab; so using len(nlp.vocab) is not effective. Instead, use nlp.meta['vectors'] to find the number of unique vectors and words. Here is the relevant section from release notes:
To reduce the initial loading time, the lexemes in nlp.vocab are no
longer loaded on initialization for models with vectors. As you
process texts, the lexemes will be added to the vocab automatically,
just as in small models without vectors.
To see the number of unique vectors and number of words with vectors,
see nlp.meta['vectors'], for example for en_core_web_md there are
20000 unique vectors and 684830 words with vectors:
{
'width': 300,
'vectors': 20000,
'keys': 684830,
'name': 'en_core_web_md.vectors'
}

How to shrink a bag-of-words model?

The question title says it all: How can I make a bag-of-words model smaller? I use a Random Forest and a bag-of-words feature set. My model reaches 30 GB in size and I am sure that most words in the feature set do not contribute to the overall performance.
How to shrink a big bag-of-words model without losing (too much) performance?

Use feature selection. Feature selection removes features from your dataset based on their distribution with regards to your labels, using some scoring function.
Features that rarely occur, or occur randomly with all your labels, for example, are very unlikely to contribute to accurate classification, and get low scores.
Here's an example using sklearn:
from sklearn.feature_selection import SelectPercentile
# Assume some matrix X and labels y
# 10 means only include the 10% best features
selector = SelectPercentile(percentile=10)
# A feature space with only 10% of the features
X_new = selector.fit_transform(X, y)
# See the scores for all features
selector.scores
As always, be sure to only call fit_transform on your training data. When using dev or test data, only use transform. See here for additional documentation.
Note that there is also a SelectKBest, which does the same, but which allows you to specify an absolute number of features to keep, instead of a percentage.

If you don't want to change the architecture of your neural network and you are only trying to reduce the memory footprint, a tweak that can be made is to reduce the terms annotated by the CountVectorizer.
From the scikit-learn documentation, we have (at least) three parameter for reduce the vocabulary size.
max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
In first instance, try to play with max_df and min_df. If the size is still not suitable with your requirements, you can drop the size as you like using the max_features.
NOTE:
The max_features tuning can drop your classification accuracy by an higher ratio than the other parameters

reducing word2vec dimension from Google News Vector Dataset

I loaded google's news vector -300 dataset. Each word is represented with a 300 point vector. I want to use this in my neural network for classification. But 300 for one word seems to be too big. How can i reduce the vector from 300 to say 100 without compromising on the quality.

tl;dr Use a dimensionality reduction technique like PCA or t-SNE.
This is not a trivial operation that you are attempting. In order to understand why, you must understand what these word vectors are.
Word embeddings are vectors that attempt to encode information about what a word means, how it can be used, and more. What makes them interesting is that they manage to store all of this information as a collection of floating point numbers, which is nice for interacting with models that process words. Rather than pass a word to a model by itself, without any indication of what it means, how to use it, etc, we can pass the model a word vector with the intention of providing extra information about how natural language works.
As I hope I have made clear, word embeddings are pretty neat. Constructing them is an area of active research, though there are a couple of ways to do it that produce interesting results. It's not incredibly important to this question to understand all of the different ways, though I suggest you check them out. Instead, what you really need to know is that each of the values in the 300 dimensional vector associated with a word were "optimized" in some sense to capture a different aspect of the meaning and use of that word. Put another way, each of the 300 values corresponds to some abstract feature of the word. Removing any combination of these values at random will yield a vector that may be lacking significant information about the word, and may no longer serve as a good representation of that word.
So, picking the top 100 values of the vector is no good. We need a more principled way to reduce the dimensionality. What you really want is to sample a subset of these values such that as much information as possible about the word is retained in the resulting vector. This is where a dimensionality reduction technique like Principle Component Analysis (PCA) or t-distributed Stochastic Neighbor Embeddings (t-SNE) come into play. I won't describe in detail how these methods work, but essentially they aim to capture the essence of a collection of information while reducing the size of the vector describing said information. As an example, PCA does this by constructing a new vector from the old one, where the entries in the new vector correspond to combinations of the main "components" of the old vector, i.e those components which account for most of the variety in the old data.
To summarize, you should run a dimensionality reduction algorithm like PCA or t-SNE on your word vectors. There are a number of python libraries that implement both (e.g scipy has a PCA algorithm). Be warned, however, that the dimensionality of these word vectors is already relatively low. To see how this is true, consider the task of naively representing a word via its one-hot encoding (a one at one spot and zeros everywhere else). If your vocabulary size is as big as the google word2vec model, then each word is suddenly associated with a vector containing hundreds of thousands of entries! As you can see, the dimensionality has already been reduced significantly to 300, and any reduction that makes the vectors significantly smaller is likely to lose a good deal of information.

#narasimman I suggest that you simply keep the top 100 numbers in the output vector of the word2vec model. The output is of type numpy.ndarray so you can do something like:
>>> word_vectors = KeyedVectors.load_word2vec_format('modelConfig/GoogleNews-vectors-negative300.bin', binary=True)
>>> type(word_vectors["hello"])
<type 'numpy.ndarray'>
>>> word_vectors["hello"][:10]
array([-0.05419922, 0.01708984, -0.00527954, 0.33203125, -0.25 ,
-0.01397705, -0.15039062, -0.265625 , 0.01647949, 0.3828125 ], dtype=float32)
>>> word_vectors["hello"][:2]
array([-0.05419922, 0.01708984], dtype=float32)
I don't think that this will screw up the result if you do it to all the words (not sure though!)

what is dimensionality in word embeddings?

I want to understand what is meant by "dimensionality" in word embeddings.
When I embed a word in the form of a matrix for NLP tasks, what role does dimensionality play? Is there a visual example which can help me understand this concept?

Answer
A Word Embedding is just a mapping from words to vectors. Dimensionality in word
embeddings refers to the length of these vectors.
Additional Info
These mappings come in different formats. Most pre-trained embeddings are
available as a space-separated text file, where each line contains a word in the
first position, and its vector representation next to it. If you were to split
these lines, you would find out that they are of length 1 + dim, where dim
is the dimensionality of the word vectors, and 1 corresponds to the word being represented. See the GloVe pre-trained
vectors for a real example.
For example, if you download glove.twitter.27B.zip, unzip it, and run the following python code:
#!/usr/bin/python3
with open('glove.twitter.27B.50d.txt') as f:
lines = f.readlines()
lines = [line.rstrip().split() for line in lines]
print(len(lines)) # number of words (aka vocabulary size)
print(len(lines[0])) # length of a line
print(lines[130][0]) # word 130
print(lines[130][1:]) # vector representation of word 130
print(len(lines[130][1:])) # dimensionality of word 130
you would get the output
1193514
51
people
['1.4653', '0.4827', ..., '-0.10117', '0.077996'] # shortened for illustration purposes
50
Somewhat unrelated, but equally important, is that lines in these files are sorted according to the word frequency found in the corpus in which the embeddings were trained (most frequent words first).
You could also represent these embeddings as a dictionary where
the keys are the words and the values are lists representing word vectors. The length
of these lists would be the dimensionality of your word vectors.
A more common practice is to represent them as matrices (also called lookup
tables), of dimension (V x D), where V is the vocabulary size (i.e., how
many words you have), and D is the dimensionality of each word vector. In
this case you need to keep a separate dictionary mapping each word to its
corresponding row in the matrix.
Background
Regarding your question about the role dimensionality plays, you'll need some theoretical background. But in a few words, the space in which words are embedded presents nice properties that allow NLP systems to perform better. One of these properties is that words that have similar meaning are spatially close to each other, that is, have similar vector representations, as measured by a distance metric such as the Euclidean distance or the cosine similarity.
You can visualize a 3D projection of several word embeddings here, and see, for example, that the closest words to "roads" are "highways", "road", and "routes" in the Word2Vec 10K embedding.
For a more detailed explanation I recommend reading the section "Word Embeddings" of this post by Christopher Olah.
For more theory on why using word embeddings, which are an instance of distributed representations, is better than using, for example, one-hot encodings (local representations), I recommend reading the first sections of Distributed Representations by Geoffrey Hinton et al.

Word embeddings like word2vec or GloVe don't embed words in two-dimensional matrices, they use one-dimensional vectors. "Dimensionality" refers to the size of these vectors. It is separate from the size of the vocabulary, which is the number of words you actually keep vectors for instead of just throwing out.
In theory larger vectors can store more information since they have more possible states. In practice there's not much benefit beyond a size of 300-500, and in some applications even smaller vectors work fine.
Here's a graphic from the GloVe homepage.
The dimensionality of the vectors is shown on the left axis; decreasing it would make the graph shorter, for example. Each column is an individual vector with color at each pixel determined by the number at that position in the vector.

The "dimensionality" in word embeddings represent the total number of features that it encodes. Actually, it is over simplification of the definition, but will come to that bit later.
The selection of features is usually not manual, it is automatic by using hidden layer in the training process. Depending on the corpus of literature the most useful dimensions (features) are selected. For example if the literature is about romantic fictions, the dimension for gender is much more likely to be represented compared to the literature of mathematics.
Once you have the word embedding vector of 100 dimensions (for example) generated by neural network for 100,000 unique words, it is not generally much useful to investigate the purpose of each dimension and try to label each dimension by "feature name". Because the feature(s) that each dimension represents may not be simple and orthogonal and since the process is automatic no body knows exactly what each dimension represents.
For more insight to understand this topic you may find this post useful.

Textual data has to be converted into numeric data before feeding into any Machine Learning algorithm.
Word Embedding is an approach for this where each word is mapped to a vector.
In algebra, A Vector is a point in space with scale & direction.
In simpler term Vector is a 1-Dimensional vertical array ( or say a matrix having single column) and Dimensionality is the number of elements in that 1-D vertical array.
Pre-trained word embedding models like Glove, Word2vec provides multiple dimensional options for each word, for instance 50, 100, 200, 300. Each word represents a point in D dimensionality space and synonyms word are points closer to each other. Higher the dimension better shall be the accuracy but computation needs would also be higher.

I'm not an expert, but I think the dimensions just represent the variables (aka attributes or features) which have been assigned to the words, although there may be more to it than that. The meaning of each dimension and total number of dimensions will be specific to your model.
I recently saw this embedding visualisation from the Tensor Flow library:
https://www.tensorflow.org/get_started/embedding_viz
This particularly helps reduce high-dimensional models down to something human-perceivable. If you have more than three variables it's extremely difficult to visualise the clustering (unless you are Stephen Hawking apparently).
This wikipedia article on dimensional reduction and related pages discuss how features are represented in dimensions, and the problems of having too many.

According to the book Neural Network Methods for Natural Language Processing by Goldenberg, dimensionality in word embeddings (demb) refers to number of columns in first weight matrix (weights between input layer and hidden layer) of embedding algorithms such as word2vec. N in the image is dimensionality in word embedding:
For more information you can refer to this link:
https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

how to perfom classfication

I'm trying to perform document classification into two categories (category1 and category2), using Weka.
I've gathered a training set consisting of 600 documents belonging to both categories and the total number of documents that are going to be classified is 1,000,000.
So to perform the classification, I apply the StringToWordVector filter. I set true the followings from the filter:
- IDF transform
- TF ransform
- OutputWordCounts
I'd like to ask a few questions about this process.
1) How many documents shall I use as training set, so that I over-fitting is avoided?
2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?
3) As classification method I usually choose naiveBayes but the results I get are the followings:
-------------------------
Correctly Classified Instances 393 70.0535 %
Incorrectly Classified Instances 168 29.9465 %
Kappa statistic 0.415
Mean absolute error 0.2943
Root mean squared error 0.5117
Relative absolute error 60.9082 %
Root relative squared error 104.1148 %
----------------------------
and if I use SMO the results are:
------------------------------
Correctly Classified Instances 418 74.5098 %
Incorrectly Classified Instances 143 25.4902 %
Kappa statistic 0.4742
Mean absolute error 0.2549
Root mean squared error 0.5049
Relative absolute error 52.7508 %
Root relative squared error 102.7203 %
Total Number of Instances 561
------------------------------
So in document classification which one is "better" classifier?
Which one is better for small data sets, like the one I have?
I've read that naiveBayes performs better with big data sets but if I increase my data set, will it cause the "over-fitting" effect?
Also, about Kappa statistic, is there any accepted threshold or it doesn't matter in this case because there are only two categories?
Sorry for the long post, but I've been trying for a week to improve the classification results with no success, although I tried to get documents that fit better in each category.

1) How many documents shall I use as training set, so that I
over-fitting is avoided? \
You don't need to choose the size of training set, in WEKA, you just use the 10-fold cross-validation. Back to the question, machine learning algorithms influence much more than data set in over-fitting problem.
2) After applying the filter, I get a list of the words in the
training set. Do I have to remove any of them to get a better result
at the classifier or it doesn't play any role? \
Definitely it does. But whether the result get better can not be promised.
3) As classification method I usually choose naiveBayes but the
results I get are the followings: \
Usually, to define whether a classify algorithm is good or not, the ROC/AUC/F-measure value is always considered as the most important indicator. You can learn them in any machine learning book.

To answers your questions:
I would use (10 fold) cross-validation to evaluate your method. The model is trained trained 10 times on 90% of the data and tested on 10% of the data using different parts of the data each time. The results are therefor less biased towards your current (random) selection of train and test set.
Removing stop words (i.e., frequently occurring words with little discriminating value like the, he or and) is a common strategy to improve your classifier. Weka's StringToWordVector allows you to select a file containing these stop words, but it should also have a default list with English stop words.
Given your results, SMO performs the best of the two classifiers (e.g., it has more Correctly Classified Instances). You might also want to take a look at (Lib)SVM or LibLinear (You may need to install them if they are not in Weka natively; Weka 3.7.6 has a package manager allowing for easy installation), which can perform quite well on document classification as well.

Regarding the second question
2) After applying the filter, I get a list of the words in the training set. Do I have to remove any of them to get a better result at the classifier or it doesn't play any role?
I was building a classifier and training it with the famous 20news group dataset, when testing it without the preprocessing the results were not good. So, i pre-processed the data according to the following steps:
Substitute TAB, NEWLINE and RETURN characters by SPACE.
Keep only letters (that is, turn punctuation, numbers, etc. into SPACES).
Turn all letters to lowercase.
Substitute multiple SPACES by a single SPACE.
The title/subject of each document is simply added in the beginning of the document's text.
no-short Obtained from the previous file, by removing words that are less than 3 characters long. For example, removing "he" but keeping "him".
no-stop Obtained from the previous file, by removing the 524 SMART stopwords. Some of them had already been removed, because they were shorter than 3 characters.
stemmed Obtained from the previous file, by applying Porter's Stemmer to the remaining
words. Information about stemming can be found here.
These steps are taken from http://web.ist.utl.pt/~acardoso/datasets/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string