How to get feature names while using HashingVectorizer in python? - scikit-learn

I want to make a 2D binary array (n_samples, n_features), where each sample is a text string and each feature is a word(unigram).
The problem is number of sample is 350000 and nunmber of feature is 40000 but my RAM size is 4GB only.
I am getting memory error after using CountVectorizer. So, is there any other way(like mini-batch) to do this?
If I use HashingVectorizer then how to get the feature_names? i.e. which column correspond to which feature?, because get_feature_names() method is not available in HashingVectorizer.

To get feature names for HashingVectorizer you can take a random sample of documents, compute hashes for them and learn which hash correspond to which tokens this way. It is not perfect because there can be other tokens which correspond to a given column, and there can be collisions, but often this is enough to inspect the vectorization result (or e.g. coefficients of a linear classifier which uses hashing features).
A shameless plug - https://github.com/TeamHG-Memex/eli5 package has this implemented:
from eli5.sklearn import InvertableHashingVectorizer
# vec should be a HashingVectorizer instance
ivec = InvertableHashingVectorizer(vec)
ivec.fit(docs_sample) # e.g. each 10-th or 100-th document
names = ivec.get_feature_names()
See also: Debugging Hashing Vectorizer section in eli5 docs.

Mini batches are not supporting in countvectorizer. However, hashing vectorizer of sklearn has partial_fit() that you can use.
Quoting sklearn documentation "There is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model."

Related

get closest vector from unknown vector with gensim

I am currently implementing a natural text generator for a school project. I have a dataset of sentences of predetermined lenght and key words, I convert them in vectors thanks to gensim and GoogleNews-vectors-negative300.bin.gz. I train a recurrent neural network to create a list of vectors that I compare to the list of vectors of the real sentence. So I try to get as close as possible to the "real" vectors.
My problem happens when I have to convert back vectors into words: my vectors aren't necessarily in the google set. So I would like to know if there is an efficient solution to get the closest vector in the Google set to an outpout vector.
I work with python 3 and Tensorflow
Thanks a lot, feel free to ask any questions about the project
Charles
The gensim method .most_similar() (on KeyedVectors & similar classes) will also accept raw vectors as the 'origin' from which to search.
Just be sure to explicitly name the positive parameter - a list of target words/vectors to combine to find the origin point.
For example:
gvecs = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz')
target_vec = gvecs['apple']
similars = gvecs.most_similar(positive=[target_vec,])

Finding both target and center word2vec matrices

I've read and heard(In the CS224 of Stanford) that the Word2Vec algorithm actually trains two matrices(that is, two sets of vectors.) These two are the U and the V set, one for words being a target and one for words being the context. The final output is the average of these two.
I have two questions in mind. one is that:
Why do we get an average of two vectors? Why it makes sense? Don't we lose some information?
The second question is, using pre-trained word2vec models, how can I get access to both matrices? Is there any downloadable word2vec with both sets of vectors? I don't have enough resources to train a new one.
Thanks
That relayed description isn't quite right. The word-vectors traditionally retrieved from a word2vec model come from a "projection matrix" which converts individual words to a right-sized input-vector for the shallow neural network.
(You could think of the projection matrix as turning a one-hot encoding into a dense-embedding for that word, but libraries typically implement this via a dictionary-lookup – eg: "what row of the vectors-matrix should I consult for this word-token?")
There's another matrix of weights leading to the model's output nodes, whose interpretation varies based on the training mode. In the common default of negative-sampling, there's one node per known word, so you could also interpret this matrix as having a vector per word. (In hierarchical-softmax mode, the known-words aren't encoded as single output nodes, so it's harder to interpret the relationship of this matrix to individual words.)
However, this second vector per word is rarely made directly available by libraries. Most commonly, the word-vector is considered simply the trained-up input vector, from the projection matrix. For example, the export format from Google's original word2vec.c release only saves-out those vectors, and the large "GoogleNews" vector set they released only has those vectors. (There's no averaging with the other output-side representation.)
Some work, especially that of Mitra et all of Microsoft Research (in "Dual Embedding Space Models" & associated writeups) has noted those output-side vectors may be of value in some applications as well – but I haven't seen much other work using those vectors. (And, even in that work, they're not averaged with the traditional vectors, but consulted as a separate option for some purposes.)
You'd have to look at the code of whichever libraries you're using to see if you can fetch these from their full post-training model representation. In the Python gensim library, this second matrix in the negative-sampling case is a model property named syn1neg, following the naming of the original word2vec.c.

How to resolve memory overloading by passing an iterator to CountVectorizer?

I'm working on extracting text features from a large dataset of documents (about 15 million documents) using CountVectorizer. I also looked at HashingVectorizer as an alternative, but I think CountVectorizer is what I need, as it provides more information about text features and other stuff.
The problem here is kinda common: I don't have enough memory when fitting the CountVectorizer model.
def getTexts():
# an iterator that will yield each document from the database
vectorizer = CountVectorizer(max_features=500, ngram_range=(1,3))
X = vectorizer.fit_transform(getTexts())
Here, let's say I have an iterator that will yield one document at a time from a database. If I pass this iterator as a parameter to CountVectorizer fit() function, how is the vocabulary built? Does it wait until finishing loading all the documents and then do the fit() once, or does it load one document at a time, do the fit, and then load the next one? What's a possible solution to resolve the memory overhead here?
The reason why CountVectorizer will consume much more memory is that the CountVectorizer needs to store a vocabulary dictionary in memory, however, the HashingVectorizer has a better memory performance because it does not need to store the vocabulary dictionary. The main difference between these two vectorizers is mentioned in the Doc of HashingVectorizer:
This strategy has several advantages:
it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an
in-memory vocabulary):
there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to
introspect which features are most important to a model.
there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if
n_features is large enough (e.g. 2 ** 18 for text classification
problems).
no IDF weighting as this would render the transformer stateful.
And of course the CountVectorizer will load one document at a time, do the fit, and then load the next one. In this process the CountVectorizer will build its vocabulary dictionary as the memory usage surging.
To optimize the memory, you may need to reduce the size of document dataset, or giving a lower max_features parameter may also help. However if you want to resolve this memory problem completely, try to use the HashingVectorizer instead of the CountVectorizer.

How handle categorical features in the latest Random Forest in Spark?

In the Mllib version of Random Forest there was a possibility to specify the columns with nominal features (numerical but still categorical variables) with parameter categoricalFeaturesInfo
What's about the ML Random Forest? In the user guide there is an example that uses VectorIndexer that converts the categorical features in vector as well, but it's written "Automatically identify categorical features, and index them"
In the other discussion of the same problem I found that numerical indexes are treated as continuous features anyway in random forest, and it's recommended to do one-hot encoding to avoid this, that seems to not make sense in the case of this algorithm, and especially given the official example mentioned above!
I noticed also that when having a lot of categories(>1000) in the categorical column, once they are indexed with StringIndexer, random forest algorithm asks me setting the MaxBin parameter, supposed to be used with continuous features. Does it mean that the features more than number of bins will be treated as continuous, as it's specified in the official example, and so StringIndexer is OK for my categorical column, or does it mean that the whole column with numerical still nominal features will be bucketized with assumption that the variables are continuous?
In the other discussion of the same problem I found that numerical indexes are treated as continuous features anyway in random forest,
This is actually incorrect. Tree models (including RandomForest) depend on column metadata to distinguish between categorical and numerical variables. Metadata can be provided by ML transformers (like StringIndexer or VectorIndexer) or added manually. The old mllib RDD-based API, which is used internally by ml models, uses categoricalFeaturesInfo Map for the same purpose.
Current API just takes the metadata and converts to the format expected by categoricalFeaturesInfo.
OneHotEncoding is required only for linear models, and recommended, although not required, for multinomial naive Bayes classifier.

Possibility to apply online algorithms on big data files with sklearn?

I would like to apply fast online dimensionality reduction techniques such as (online/mini-batch) Dictionary Learning on big text corpora.
My input data naturally do not fit in the memory (this is why i want to use an online algorithm) so i am looking for an implementation that can iterate over a file rather than loading everything in memory.
Is it possible to do this with sklearn ? are there alternatives ?
Thanks
register
For some algorithms supporting partial_fit, it would be possible to write an outer loop in a script to do out-of-core, large scale text classification. However there are some missing elements: a dataset reader that iterates over the data on the disk as folders of flat files or a SQL database server, or NoSQL store or a Solr index with stored fields for instance. We also lack an online text vectorizer.
Here is a sample integration template to explain how it would fit together.
import numpy as np
from sklearn.linear_model import Perceptron
from mymodule import SomeTextDocumentVectorizer
from mymodule import DataSetReader
dataset_reader = DataSetReader('/path/to/raw/data')
expected_classes = dataset_reader.get_all_classes() # need to know the possible classes ahead of time
feature_extractor = SomeTextDocumentVectorizer()
classifier = Perceptron()
dataset_reader = DataSetReader('/path/to/raw/data')
for i, (documents, labels) in enumerate(dataset_reader.iter_chunks()):
vectors = feature_extractor.transform(documents)
classifier.partial_fit(vectors, labels, classes=expected_classes)
if i % 100 == 0:
# dump model to be able to monitor quality and later analyse convergence externally
joblib.dump(classifier, 'model_%04d.pkl' % i)
The dataset reader class is application specific and will probably never make it into scikit-learn (except maybe for a folder of flat text files or CSV files that would not require to add a new dependency to the library).
The text vectorizer part is more problematic. The current vectorizer does not have a partial_fit method because of the way we build the in-memory vocabulary (a python dict that is trimmed depending on max_df and min_df). We could maybe build one using an external store and drop the max_df and min_df features.
Alternatively we could build an HashingTextVectorizer that would use the hashing trick to drop the dictionary requirements. None of those exist at the moment (although we already have some building blocks such as a murmurhash wrapper and a pull request for hashing features).
In the mean time I would advise you to have a look at Vowpal Wabbit and maybe those python bindings.
Edit: The sklearn.feature_extraction.FeatureHasher class has been merged into the master branch of scikit-learn and will be available in the next release (0.13). Have a look at the documentation on feature extraction.
Edit 2: 0.13 is now released with both FeatureHasher and HashingVectorizerthat can directly deal with text data.
Edit 3: there is now an example on out-of-core learning with the Reuters dataset in the official example gallery of the project.
Since Sklearn 0.13 there is indeed an implementation of the HashingVectorizer.
EDIT: Here is a full-fledged example of such an application
Basically, this example demonstrates that you can learn (e.g. classify text) on data that cannot fit in the computer's main memory (but rather on disk / network / ...).
In addition to Vowpal Wabbit, gensim might be interesting as well - it too features online Latent Dirichlet Allocation.

Resources