How does the model in sklearn handle large data sets in python?

How does the model in sklearn handle large data sets in python? - scikit-learn

Now I have 10GB of data set to train the model in sklearn, but my computer only has 8GB of memory, so I have other ways to go besides incremental classifier.

I think sklearn can be used for larger data if the technique is right. If your chosen algorithms support partial_fit or an online learning approach then you're on track. The chunk_size may influence your success
This link may be useful( Working with big data in python and numpy, not enough ram, how to save partial results on the disc?)
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

Related

How to optimize memory footprint of Stanza models

I'm using Stanza to get tokens, lemmas and tags from documents in multiple languages for the purposes of a language learning app. This means that I need to store and load many Stanza (default) models for different languages.
My main problem right now is that if I want to load all those models the memory requirement is too much for my resources. I currently deploy a web API running Stanza NLP on AWS. I want to keep my infrastructure costs at a minimum.
One possible solution is to load one model at a time when I need to run my script. I guess that means there will be some extra overhead each time in order to load the model in memory.
Another thing I tried is just to use the processors that I really need which decreases the memory footprint but not by that much.
I tried looking at open and closed issues on Github and Google but didn't find much.
What other possible solutions are out there?

The bottom line is a model for a language has to be in memory during execution, so by some means or another you need to make the model smaller or tolerate storing models on disk. I can offer some suggestions to make the models smaller, though be warned that making your model smaller will probably result in poorer accuracy.
You could examine the percentage breakdown of language requests, and store commonly requested languages in memory and only go to disk for rarer language requests.
The most immediate impact strategy for reducing model size is to shrink the vocabulary size. It is possible you could cut the vocabulary even smaller and still get similar accuracy. We have done some optimization on this front, but there may be more opportunity to cut model size.
You could experiment with smaller model size and word embeddings and may only get a small accuracy drop, we haven't really aggressively experimented with different model sizes to see how much accuracy you lose. This would mean retraining the model and just setting the embedding size and model size parameters smaller.
I don't know a lot about this, but there is a strategy of tagging a bunch of data with your big accurate model, and then training a smaller model to mimic the big model. I believe this is called "knowledge distillation".
In a similar direction, you could tag a bunch of data with Stanza, and then train a CoreNLP model (which I think would have a smaller memory footprint).
In summary, I think the easiest thing to do would be to retrain a model with a smaller vocabulary size. We I think it currently has 250,000 words, and cutting to 10,000 or 50,000 will reduce model size, but may not affect accuracy too badly.
Unfortunately I don't think there is a magical option you can select that will just solve this issue, you will have to retrain models and see what kind of accuracy you are willing to sacrifice for a lower memory footprint.

pytorch dataset map-style vs iterable-style

A map-style dataset in Pytorch has the __getitem__() and __len__() and iterable-style datasets has __iter__() protocol. If we use map-style, we can access the data with dataset[idx] which is great, however with the iterable dataset we can't.
My question is why this distinction was necessary? What makes the data random read so expensive or even improbable?

I wrote a short post on how to use PyTorch datasets, and the difference between map-style and iterable-style dataset.
In essence, you should use map-style datasets when possible. Map-style datasets give you their size ahead of time, are easier to shuffle, and allow for easy parallel loading.
It’s a common misconception that if your data doesn’t fit in memory, you have to use iterable-style dataset. That is not true. You can implement a map-style dataset such that it retrives data as needed.
Check out the full post here.

It's quite possible that the full dataset doesn't fit in memory (could be on a disk, or only accessible over a network). A stream of information doesn't have to be retained if you're not going to access arbitrary offsets. If you're going to request data[0], then data[1], then data[2] over a network, you're sending a lot of requests which introduces latency.
Iterable-like (ResultSet) objects are typical when incrementally reading rows in the results of a database query. It's also conceivable that a dataset could inherently be a stream of information, like logging data, or transactions, or incrementally discovered pages found by a web crawler.

Deep learning on massive datasets

Theoretical question here. I understand that when dealing with datasets that cannot fit into memory on a single machine, spark + EMR is a great way to go.
However, I would also like to use tensorflow instead of spark's ml lib algorithms to perform deep learning on these large datasets.
From my research I see that I could potentially use a combination of pyspark, elephas and EMR to achieve this. Alternatively there is BigDL and sparkdl.
Am I going about this the wrong way? What is best practice for deep learning on data that cannot fit into memory? Should I use online learning or batch training instead? This post seems to say that "most high-performance deep learning implementations are single-node only"
Any help to point me in the right direction would be greatly appreciated.

In TensorFlow, you can use tf.data.Dataset.from_generator so you can generate your dataset at runtime without any storage hassles.
See link for example https://www.codespeedy.com/what-is-tf-data-dataset-from_generator-in-tensorflow/

As you mention "fitting massive dataset to memory", I understand that you are trying to load all data to memory at once and start training. Hence, I give the reply based on this assumption.
General mentality is that if you cannot fit the data to your resources, divide data into smaller chunks and train in an iterative way.
1- Load data one by one instead of trying to load all at once. If you create an execution workflow as "Load Data -> Train -> Release Data (This can be done automatically by garbage collectors) -> Restart" , you can understand how much resource is needed to train single data.
2- Use mini-batches. As soon as you get the resource information from #1, you can make an easy calculation to estimate the mini-batch size. For example, if training single data consumes 1.5 GB of RAM, and your GPU has 8 GB of RAM, theoretically you may train mini-batches with size 5 at once.
3- If the resources are not enough to train even 1-sized single batch, in this case, you may think about increasing your PC capacity or decreasing your model capacity / layers / features. Alternatively, you can go for cloud computing solutions.

What is an appropriate training set size for sentiment analysis?

I'm looking to use some tweets about measles/ the mmr vaccine to see how sentiment about vaccination changes over time. I plan on creating the training set from the corpus of data I currently have (unless someone has a recommendation on where I can get similar data).
I would like to classify a tweet as either: Pro-vaccine, Anti-Vaccine, or Neither (these would be factual tweets about outbreaks).
So the question is: How big is big enough? I want to avoid problems of overfitting (so I'll do a test train split) but as I include more and more tweets, the number of features needing to be learned increases dramatically.
I was thinking 1000 tweets (333 of each). Any input is appreciated here, and if you could recommend some resources, that would be great too.

More is always better. 1000 tweets on a 3-way split seems quite ambitious, I would even consider 1000 per class for a 3-way split on tweets quite low. Label as many as you can within a feasible amount of time.
Also, it might be worth taking a cascaded approach (esp. with so little data), i.e. label a set vaccine vs non-vaccine, and within the vaccine subset you'd have a pro vs anti set.
In my experience trying to model a catch-all "neutral" class, that contains everything that is not explicitly "pro" or "anti" is quite difficult because there is so much noise. Especially with simpler models such as Naive Bayes, I have found the cascaded approach to be working quite well.

Best practice for training on large scale datasets like ImageNet using Theano/Lasagne?

I found that all of the examples of Theano/Lasagne deal with small data set like mnist and cifar10 which can be loaded into memory completely.
My question is how to write efficient code for training on large scale datasets?
Specifically, what is the best way to prepare mini-batches (including real time data augmentation) in order to keep the GPU busy?
Maybe like using CAFFE's ImageDataLayer?
For example, I have a big txt file which contains all the image paths and labels.
It would be appreciated to show some code.
Thank you very much!

In case the data doesn't fit into memory, a good way is to prepare the minibatches and store them into an HDF5 file, which is then used at training time.
However, this does suffice when doing data augmentation as this is done on the fly. Because of Pythons global interpreter lock, images cannot already be loaded and preprocesed while the GPU is busy.
The best way around this, that I know of, is the Fuel library.
Fuel loads and preprocesses the minibatches in a different python process and then streams them to the training process over a TCP socket:
http://fuel.readthedocs.org/en/latest/server.html#data-processing-server
It additionally provides some functions to preprocess the data, such as scaling and mean subtraction:
http://fuel.readthedocs.org/en/latest/overview.html#transformers-apply-some-transformation-on-the-fly
Hope this helps.
Michael

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string