Possibility to apply online algorithms on big data files with sklearn? - scikit-learn

I would like to apply fast online dimensionality reduction techniques such as (online/mini-batch) Dictionary Learning on big text corpora.
My input data naturally do not fit in the memory (this is why i want to use an online algorithm) so i am looking for an implementation that can iterate over a file rather than loading everything in memory.
Is it possible to do this with sklearn ? are there alternatives ?
Thanks
register

For some algorithms supporting partial_fit, it would be possible to write an outer loop in a script to do out-of-core, large scale text classification. However there are some missing elements: a dataset reader that iterates over the data on the disk as folders of flat files or a SQL database server, or NoSQL store or a Solr index with stored fields for instance. We also lack an online text vectorizer.
Here is a sample integration template to explain how it would fit together.
import numpy as np
from sklearn.linear_model import Perceptron
from mymodule import SomeTextDocumentVectorizer
from mymodule import DataSetReader
dataset_reader = DataSetReader('/path/to/raw/data')
expected_classes = dataset_reader.get_all_classes() # need to know the possible classes ahead of time
feature_extractor = SomeTextDocumentVectorizer()
classifier = Perceptron()
dataset_reader = DataSetReader('/path/to/raw/data')
for i, (documents, labels) in enumerate(dataset_reader.iter_chunks()):
vectors = feature_extractor.transform(documents)
classifier.partial_fit(vectors, labels, classes=expected_classes)
if i % 100 == 0:
# dump model to be able to monitor quality and later analyse convergence externally
joblib.dump(classifier, 'model_%04d.pkl' % i)
The dataset reader class is application specific and will probably never make it into scikit-learn (except maybe for a folder of flat text files or CSV files that would not require to add a new dependency to the library).
The text vectorizer part is more problematic. The current vectorizer does not have a partial_fit method because of the way we build the in-memory vocabulary (a python dict that is trimmed depending on max_df and min_df). We could maybe build one using an external store and drop the max_df and min_df features.
Alternatively we could build an HashingTextVectorizer that would use the hashing trick to drop the dictionary requirements. None of those exist at the moment (although we already have some building blocks such as a murmurhash wrapper and a pull request for hashing features).
In the mean time I would advise you to have a look at Vowpal Wabbit and maybe those python bindings.
Edit: The sklearn.feature_extraction.FeatureHasher class has been merged into the master branch of scikit-learn and will be available in the next release (0.13). Have a look at the documentation on feature extraction.
Edit 2: 0.13 is now released with both FeatureHasher and HashingVectorizerthat can directly deal with text data.
Edit 3: there is now an example on out-of-core learning with the Reuters dataset in the official example gallery of the project.

Since Sklearn 0.13 there is indeed an implementation of the HashingVectorizer.
EDIT: Here is a full-fledged example of such an application
Basically, this example demonstrates that you can learn (e.g. classify text) on data that cannot fit in the computer's main memory (but rather on disk / network / ...).

In addition to Vowpal Wabbit, gensim might be interesting as well - it too features online Latent Dirichlet Allocation.

Related

Unable to Load FastText model

I am trying to load the FastText and save that as a model so that I can deploy that on production as the file size is 1.2 gb and wont be a good practice to use that on Prod.
Can anyone suggest an approach to save and load the model for production ("fasttext-wiki-news-subwords-300")
Loading the file using gensim.downloader api
You can use the library https://github.com/avidale/compress-fasttext, which is a wrapper around Gensim that can serve compressed versions of unsupervised FastText models.
The compressed versions can be orders of magnitude smaller (e.g. 20mb), with a tolerable loss in quality.
In order to have clarity over exactly what you're getting, in what format, I strongly recommend downloading things like sets of pretrained vectors from their original sources rather than the Gensim gensim.downloader convenience methods. (That API also, against most users' expectations & best packaging hygeine, will download & run arbitrary other code that's not part of Gensim's version-controlled source repository or its official PyPI package. See project issue #2283.)
For example, you could grab the raw vectors files direct from: https://fasttext.cc/docs/en/english-vectors.html
The tool from ~david-dale's answer looks interesting, for its radical compression, and if you can verify the compressed versions still work well for your purposes, it may be an ideal approach for memory-limited production deployments.
I would also consider:
A production machine with enough GB of RAM to load the full model may not be too costly, and with these sorts of vector-models, typical access patterns mean you essentially always want the full model in RAM, with no virtual-memory swapping at all. If your deployment is in a web server, there are some memory-mapping tricks possible that can help many processes share the same singly-loaded copy of the model (to avoid time- and memory-consumptive redundant reloads). See this answer for an approach that works with Word2Vec (though that may need some adaptation for FastText & recent Gensim versions).
If you don't need the Fasttext-specific subword-based synthesis, you can save the full-word vectors to a file in a simple format, then choose to only reload any small subset of the leading vectors (most common words) using the limit option of load_word2vec_format(). For exmaple:
# save only the word-vectors from a FastText model
ft_model.wv.save_word2vec_format('wvonly.txt', binary=False)
# ... then, later/elsewhere:
# load only 1st 50,000 word-vectors
wordvecs = KeyedVectors.load_word2vec_format('wvonly.txt', binary=False, limit=50000)

How to resolve memory overloading by passing an iterator to CountVectorizer?

I'm working on extracting text features from a large dataset of documents (about 15 million documents) using CountVectorizer. I also looked at HashingVectorizer as an alternative, but I think CountVectorizer is what I need, as it provides more information about text features and other stuff.
The problem here is kinda common: I don't have enough memory when fitting the CountVectorizer model.
def getTexts():
# an iterator that will yield each document from the database
vectorizer = CountVectorizer(max_features=500, ngram_range=(1,3))
X = vectorizer.fit_transform(getTexts())
Here, let's say I have an iterator that will yield one document at a time from a database. If I pass this iterator as a parameter to CountVectorizer fit() function, how is the vocabulary built? Does it wait until finishing loading all the documents and then do the fit() once, or does it load one document at a time, do the fit, and then load the next one? What's a possible solution to resolve the memory overhead here?
The reason why CountVectorizer will consume much more memory is that the CountVectorizer needs to store a vocabulary dictionary in memory, however, the HashingVectorizer has a better memory performance because it does not need to store the vocabulary dictionary. The main difference between these two vectorizers is mentioned in the Doc of HashingVectorizer:
This strategy has several advantages:
it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an
in-memory vocabulary):
there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to
introspect which features are most important to a model.
there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if
n_features is large enough (e.g. 2 ** 18 for text classification
problems).
no IDF weighting as this would render the transformer stateful.
And of course the CountVectorizer will load one document at a time, do the fit, and then load the next one. In this process the CountVectorizer will build its vocabulary dictionary as the memory usage surging.
To optimize the memory, you may need to reduce the size of document dataset, or giving a lower max_features parameter may also help. However if you want to resolve this memory problem completely, try to use the HashingVectorizer instead of the CountVectorizer.

MemoryError on joblib dump

I have the following snippet running to train a model for text classification. I optmized it quite a bit and it's running pretty smoothly however, it still uses a lot of RAM. Our dataset is huge (13 million documents + 18 million words in the vocabulary) but the point in execution throwing the error is very weird, in my opinion. The script:
encoder = LabelEncoder()
y = encoder.fit_transform(categories)
classes = list(range(0, len(encoder.classes_)))
vectorizer = CountVectorizer(vocabulary=vocabulary,
binary=True,
dtype=numpy.int8)
classifier = SGDClassifier(loss='modified_huber',
n_jobs=-1,
average=True,
random_state=1)
tokenpath = modelpath.joinpath("tokens")
for i in range(0, len(batches)):
token_matrix = joblib.load(
tokenpath.joinpath("{}.pickle".format(i)))
batchsize = len(token_matrix)
classifier.partial_fit(
vectorizer.transform(token_matrix),
y[i * batchsize:(i + 1) * batchsize],
classes=classes
)
joblib.dump(classifier, modelpath.joinpath('classifier.pickle'))
joblib.dump(vectorizer, modelpath.joinpath('vectorizer.pickle'))
joblib.dump(encoder, modelpath.joinpath('category_encoder.pickle'))
joblib.dump(options, modelpath.joinpath('extraction_options.pickle'))
I got the MemoryError at this line:
joblib.dump(vectorizer, modelpath.joinpath('vectorizer.pickle'))
At this point in execution, training is finished and the classifier is already dumped. It should be collected by the garbage collector in case more memory is needed. In addition to it, why should joblib allocate so much memory if it isn't even compressing the data.
I do not have deep knowledge of the inner workings of the python garbage collector. Should I be forcing gc.collect() or use 'del' statments to free those objects that are no longer needed?
Update:
I have tried using the HashingVectorizer and, even though it greatly reduces memory usage, the vectorizing is way slower making it not a very good alternative.
I have to pickle the vectorizer to later use it in the classification process so I can generate the sparse matrix that is submitted to the classifier. I will post here my classification code:
extracted_features = joblib.Parallel(n_jobs=-1)(
joblib.delayed(features.extractor) (d, extraction_options) for d in documents)
probabilities = classifier.predict_proba(
vectorizer.transform(extracted_features))
predictions = category_encoder.inverse_transform(
probabilities.argmax(axis=1))
trust = probabilities.max(axis=1)
If you are providing your custom vocabulary to the CountVectorizer, it should not be a problem to recreate it later on, during classification. As you provide set of strings instead of a mapping, you probably want to use the parsed vocabulary, which you can access with:
parsed_vocabulary = vectorizer.vocabulary_
joblib.dump(parsed_vocabulary, modelpath.joinpath('vocabulary.pickle'))
and then load it and use to re-create the CountVectorizer:
vectorizer = CountVectorizer(
vocabulary=parsed_vocabulary,
binary=True,
dtype=numpy.int8
)
Note that you do not need to use joblib here; the standard pickle should perform the same; you might get better results using any of available alternatives, with PyTables being worth mentioning.
If that uses to much of the memory too, you should try using the original vocabulary for recreation of the vectorizer; currently, when provided with a set of strings as vocabulary, vectorizers just convert sets to sorted lists so you shouldn't need to worry about reproducibility (although I would double check that before using in production). Or you could just convert the set to a list on your own.
To sum up: because you do not fit() the Vectorizer, the whole added value of using CountVectorizer is its transform() method; as the whole needed data is the vocabulary (and parameters) you might reduce the memory consumption pickling just your vocabulary, either processed or not.
As you asked for answer drawing from official sources, I would like to point you to: https://github.com/scikit-learn/scikit-learn/issues/3844 where an owner and a contributor of scikit-learn mention recreating a CountVectorizer, albeit for other purposes. You may have better luck reporting your problems in the linked repo, but make sure to include a dataset which causes excessive memory usage issues to make it reproducible.
And finally you may just use HashingVectorizer as mentioned earlier in a comment.
PS: regarding the use of gc.collect() - I would give it a go in this case; regarding the technical details, you will find many questions on SO tackling this issue.

balancing an imbalanced dataset with keras image generator

The keras
ImageDataGenerator
can be used to "Generate batches of tensor image data with real-time data augmentation"
The tutorial here demonstrates how a small but balanced dataset can be augmented using the ImageDataGenerator. Is there an easy way to use this generator to augment a heavily unbalanced dataset, such that the resulting, generated dataset is balanced?
This would not be a standard approach to deal with unbalanced data. Nor do I think it would be really justified - you would be significantly changing the distributions of your classes, where the smaller class is now much less variable. The larger class would have rich variation, the smaller would be many similar images with small affine transforms. They would live on a much smaller region in image space than the majority class.
The more standard approaches would be:
the class_weights argument in model.fit, which you can use to make the model learn more from the minority class.
reducing the size of the majority class.
accepting the imbalance. Deep learning can cope with this, it just needs lots more data (the solution to everything, really).
The first two options are really kind of hacks, which may harm your ability to cope with real world (imbalanced) data. Neither really solves the problem of low variability, which is inherent in having too little data. If application to a real world dataset after model training isn't a concern and you just want good results on the data you have, then these options are fine (and much easier than making generators for a single class).
The third option is the right way to go if you have enough data (as an example, the recent paper from Google about detecting diabetic retinopathy achieved high accuracy in a dataset where positive cases were between 10% and 30%).
If you truly want to generate a variety of augmented images for one class over another, it would probably be easiest to do it in pre-processing. Take the images of the minority class and generate some augmented versions, and just call it all part of your data. Like I say, this is all pretty hacky.
You can use this strategy to calculate weights based on the imbalance:
from sklearn.utils import class_weight
import numpy as np
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train_generator.classes),
train_generator.classes)
train_class_weights = dict(enumerate(class_weights))
model.fit_generator(..., class_weight=train_class_weights)
This answer was inspire by Is it possible to automatically infer the class_weight from flow_from_directory in Keras?

How to get feature names while using HashingVectorizer in python?

I want to make a 2D binary array (n_samples, n_features), where each sample is a text string and each feature is a word(unigram).
The problem is number of sample is 350000 and nunmber of feature is 40000 but my RAM size is 4GB only.
I am getting memory error after using CountVectorizer. So, is there any other way(like mini-batch) to do this?
If I use HashingVectorizer then how to get the feature_names? i.e. which column correspond to which feature?, because get_feature_names() method is not available in HashingVectorizer.
To get feature names for HashingVectorizer you can take a random sample of documents, compute hashes for them and learn which hash correspond to which tokens this way. It is not perfect because there can be other tokens which correspond to a given column, and there can be collisions, but often this is enough to inspect the vectorization result (or e.g. coefficients of a linear classifier which uses hashing features).
A shameless plug - https://github.com/TeamHG-Memex/eli5 package has this implemented:
from eli5.sklearn import InvertableHashingVectorizer
# vec should be a HashingVectorizer instance
ivec = InvertableHashingVectorizer(vec)
ivec.fit(docs_sample) # e.g. each 10-th or 100-th document
names = ivec.get_feature_names()
See also: Debugging Hashing Vectorizer section in eli5 docs.
Mini batches are not supporting in countvectorizer. However, hashing vectorizer of sklearn has partial_fit() that you can use.
Quoting sklearn documentation "There is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model."

Resources