I'd like to evaluate how many memory is used up by both
sklearn.ensemble.IsolationForest
sklearn.ensemble.RandomForestClassifier
But
sys.sizeof(my_isolation_forest_model)
sys.sizeof(my_random_forest_classifier_model)
always returns a value of 48, no matter how the model is fit.
Can you help me find out how much memory space are my models using?
Scikit-Learn trees are represented using Cython data structures, which do not interact with Python's sizeof function in a meaningful way - you're just seeing the size of pointer(s).
As a workaround, you can dump these Cython data structures into a Pickle file (or any alternative serialization protocol/data format), and then measure the size of this file. The data from Cython data structures will all be there.
When using Pickle, be sure to disable any data compression!
Related
Why PyTorch creates another repro called TorchData for similar/new Dataset and DataLoader instead of adding them in the existing PyTorch repro? What's the difference of Dataset and Datapipe? Thanks.
TorchData is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines.
It aims to provide composable Iterable-style and Map-style building blocks called DataPipes that work well out of the box with the PyTorch's DataLoader. It contains functionality to reproduce many different datasets in TorchVision and TorchText, namely including loading, parsing, caching, and several other utilities (e.g. hash checking).
DataPipe is simply a renaming and repurposing of the PyTorch Dataset for composed usage. A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipes and __getitem__ for MapDataPipes, and returns a new access function with a slight transformation applied.
This blog, introducing OpenAI's new python extension called Triton, says this about why Triton can do matrix math faster than pytorch (referring to an an example of how Triton can be used to compute Softmax along the rows of an m by n matrix)
Importantly, this particular implementation of softmax keeps the rows of X in SRAM throughout the entire normalization process, which maximizes data reuse when applicable (~<32K columns). This differs from PyTorch’s internal CUDA code, whose use of temporary memory makes it more general but significantly slower (below). The bottom line here is not that Triton is inherently better, but that it simplifies the development of specialized kernels that can be much faster than those found in general-purpose libraries.
How does pytorch allocate memory for device tensors, what is the "temporary memory" being referred to here? Why is the use of this temporary memory more general, but slower than use of SRAM?
Is SRAM here referring to cache memory? If so, how/why does this library make better use of cache memory than pytorch internals? My understanding is that the decision about what data to cache is mostly up to the hardware rather than software.
I am working with a very large dataset (1.5 Million rows) and thought about using an SVR.
Since there is so much data I though about switching to a linear SVM and using the nystroem
method to make a kernel from the uniform sampled data.
However I would rather like to construct the kernel via Kernel K-Means, but I did not find an official
implementation yet.
This link provides a unofficual method, but this results in a very large model since it is serialized.
https://tslearn.readthedocs.io/en/stable/gen_modules/clustering/tslearn.clustering.KernelKMeans.html
Maybe someone has a clue where to look for this or how to implement this codewise from an arbitrary dataset?
Now I have 10GB of data set to train the model in sklearn, but my computer only has 8GB of memory, so I have other ways to go besides incremental classifier.
I think sklearn can be used for larger data if the technique is right. If your chosen algorithms support partial_fit or an online learning approach then you're on track. The chunk_size may influence your success
This link may be useful( Working with big data in python and numpy, not enough ram, how to save partial results on the disc?)
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).
I'm working on extracting text features from a large dataset of documents (about 15 million documents) using CountVectorizer. I also looked at HashingVectorizer as an alternative, but I think CountVectorizer is what I need, as it provides more information about text features and other stuff.
The problem here is kinda common: I don't have enough memory when fitting the CountVectorizer model.
def getTexts():
# an iterator that will yield each document from the database
vectorizer = CountVectorizer(max_features=500, ngram_range=(1,3))
X = vectorizer.fit_transform(getTexts())
Here, let's say I have an iterator that will yield one document at a time from a database. If I pass this iterator as a parameter to CountVectorizer fit() function, how is the vocabulary built? Does it wait until finishing loading all the documents and then do the fit() once, or does it load one document at a time, do the fit, and then load the next one? What's a possible solution to resolve the memory overhead here?
The reason why CountVectorizer will consume much more memory is that the CountVectorizer needs to store a vocabulary dictionary in memory, however, the HashingVectorizer has a better memory performance because it does not need to store the vocabulary dictionary. The main difference between these two vectorizers is mentioned in the Doc of HashingVectorizer:
This strategy has several advantages:
it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory
it is fast to pickle and un-pickle as it holds no state besides the constructor parameters
it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an
in-memory vocabulary):
there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to
introspect which features are most important to a model.
there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if
n_features is large enough (e.g. 2 ** 18 for text classification
problems).
no IDF weighting as this would render the transformer stateful.
And of course the CountVectorizer will load one document at a time, do the fit, and then load the next one. In this process the CountVectorizer will build its vocabulary dictionary as the memory usage surging.
To optimize the memory, you may need to reduce the size of document dataset, or giving a lower max_features parameter may also help. However if you want to resolve this memory problem completely, try to use the HashingVectorizer instead of the CountVectorizer.