How to optimize memory footprint of Stanza models - nlp

I'm using Stanza to get tokens, lemmas and tags from documents in multiple languages for the purposes of a language learning app. This means that I need to store and load many Stanza (default) models for different languages.
My main problem right now is that if I want to load all those models the memory requirement is too much for my resources. I currently deploy a web API running Stanza NLP on AWS. I want to keep my infrastructure costs at a minimum.
One possible solution is to load one model at a time when I need to run my script. I guess that means there will be some extra overhead each time in order to load the model in memory.
Another thing I tried is just to use the processors that I really need which decreases the memory footprint but not by that much.
I tried looking at open and closed issues on Github and Google but didn't find much.
What other possible solutions are out there?

The bottom line is a model for a language has to be in memory during execution, so by some means or another you need to make the model smaller or tolerate storing models on disk. I can offer some suggestions to make the models smaller, though be warned that making your model smaller will probably result in poorer accuracy.
You could examine the percentage breakdown of language requests, and store commonly requested languages in memory and only go to disk for rarer language requests.
The most immediate impact strategy for reducing model size is to shrink the vocabulary size. It is possible you could cut the vocabulary even smaller and still get similar accuracy. We have done some optimization on this front, but there may be more opportunity to cut model size.
You could experiment with smaller model size and word embeddings and may only get a small accuracy drop, we haven't really aggressively experimented with different model sizes to see how much accuracy you lose. This would mean retraining the model and just setting the embedding size and model size parameters smaller.
I don't know a lot about this, but there is a strategy of tagging a bunch of data with your big accurate model, and then training a smaller model to mimic the big model. I believe this is called "knowledge distillation".
In a similar direction, you could tag a bunch of data with Stanza, and then train a CoreNLP model (which I think would have a smaller memory footprint).
In summary, I think the easiest thing to do would be to retrain a model with a smaller vocabulary size. We I think it currently has 250,000 words, and cutting to 10,000 or 50,000 will reduce model size, but may not affect accuracy too badly.
Unfortunately I don't think there is a magical option you can select that will just solve this issue, you will have to retrain models and see what kind of accuracy you are willing to sacrifice for a lower memory footprint.

Related

Word2Vec clustering: embed with low dimensionality or with high dimensionality and then reduce?

I am using K-means for topic modelling using Word2Vec and would like to understand the implications of vectorizing up to, let's say, 10 dimensions, against embedding it with 200 dimensions and then using PCA to get down to 10. Does the second approach make sense at all?
Which one worked better for your specific purposes, & your specific data, after trying both & comparing the end-results against each other, either in some ad-hoc ("eyeballing") or rigorous way?
There's no reason to prematurely reject any approach, given how many details about your data & ultimate end-goals are unstated.
It would be atypical to train a word2vec model to have only 10 dimensions. Published work most often shows the use of 100 to 1000 dimensions, often 300 or 400, assuming you've got enough bulk training data to make the algorithm worthwhile.
(Word2vec needs a lot of varied training text, with many contrasting usage examples for every word of interest, to generate good results. You may occasionally see toy-sized demos, on smaller amounts of data, just to quickly show steps, or some major qualities of the results. But good results, in the aspects for which word2vec is most appreciated, depend on plentiful training data.)
Also, whether or not your aims would be helped by the extra step of PCA to reduce the dimensionality of a larger word2vec model seems another separable question, to be determined experimentally by comparing results with and without that step, on your actual data/problem, rather than guessed at from intuitions from other projects that might not be comparable.

Loading a model checkpoint in lesser amount of memory

I had a question that I can't find any answers to online. I have trained a model whose checkpoint file is about 20 GB. Since I do not have enough RAM with my system (or Colaboratory/Kaggle either - the limit being 16 GB), I can't use my model for predictions.
I know that the model has to be loaded into memory for the inferencing to work. However, is there a workaround or a method that can:
Save some memory and be able to load it in 16 GB of RAM (for CPU), or the memory in the TPU/GPU
Can use any framework (since I would be working with both) TensorFlow + Keras, or PyTorch (which I am using right now)
Is such a method even possible to do in either of these libraries? One of my tentative solutions was not load it in chunks perhaps, essentially maintaining a buffer for the model weights and biases and performing calculations accordingly - though I haven't found any implementations for that.
I would also like to add that I wouldn't mind the performance slowdown since it is to be expected with low-specification hardware. As long as it doesn't take more than two weeks :) I can definitely wait that long...
Yoy can try the following:
split model by two parts
load weights to the both parts separately calling model.load_weights(by_name=True)
call the first model with your input
call the second model with the output of the first model

How to use a trained deeplearning model at different resolutions?

I have trained a model for image segmentation task on 320x240x3 resolution images using tensorflow 2.x. I am wondering if there is a way to use the same model or tweak the model to make it work on different resolutions?
I have to use a model trained on a 320x240 resolution for Full HD (1920x1080) and SD(1280x720) images but as the GPU Memory is not sufficient to train the model at the specified resolutions with my architecture, I have trained it on 320x240 images.
I am looking for a scalable solution that works at all the resolutions. Any Suggestions?
The answer to your question is no: you cannot use a model trained at a particular resolution to be used at different resolution; in essence, this is why we train the models at different resolutions, to check the performance and possibly improve it.
The suggestion below omits one crucial aspect: that, depending on the task at hand, increasing the resolution can considerably improve the results in object detection and image segmentation, particularly if you have small objects.
The only solution for your problem, considering the GPU memory constraint, is to try to split the initial image into smaller parts (or maybe tiles) and train per part(say 320x240) and then reconstruct the initial image; otherwise, there is no other solution than to increase the GPU memory in order to train at higher resolutions.
PS: I understood your question after reading it a couple of times; I suggest that you modify a little bit the details w.r.t the resolution.
YEAH, you can do it in high resolution image. But the small resolution is easy to train and it is easy for the model to find the features of the image. Training in small resolution models saves your time and makes your model faster since it has the less number of parameters. HD images contains large amount of pixels, so if you train your model in higher resolution images, it makes your training and model slower as it contains large number of parameters due to the presence of higher number of pixels and it makes difficult for your model to find features in the high resolution image. So, mostly your are advisable to use lower resolution instead of higher resolution.

Deep learning on massive datasets

Theoretical question here. I understand that when dealing with datasets that cannot fit into memory on a single machine, spark + EMR is a great way to go.
However, I would also like to use tensorflow instead of spark's ml lib algorithms to perform deep learning on these large datasets.
From my research I see that I could potentially use a combination of pyspark, elephas and EMR to achieve this. Alternatively there is BigDL and sparkdl.
Am I going about this the wrong way? What is best practice for deep learning on data that cannot fit into memory? Should I use online learning or batch training instead? This post seems to say that "most high-performance deep learning implementations are single-node only"
Any help to point me in the right direction would be greatly appreciated.
In TensorFlow, you can use tf.data.Dataset.from_generator so you can generate your dataset at runtime without any storage hassles.
See link for example https://www.codespeedy.com/what-is-tf-data-dataset-from_generator-in-tensorflow/
As you mention "fitting massive dataset to memory", I understand that you are trying to load all data to memory at once and start training. Hence, I give the reply based on this assumption.
General mentality is that if you cannot fit the data to your resources, divide data into smaller chunks and train in an iterative way.
1- Load data one by one instead of trying to load all at once. If you create an execution workflow as "Load Data -> Train -> Release Data (This can be done automatically by garbage collectors) -> Restart" , you can understand how much resource is needed to train single data.
2- Use mini-batches. As soon as you get the resource information from #1, you can make an easy calculation to estimate the mini-batch size. For example, if training single data consumes 1.5 GB of RAM, and your GPU has 8 GB of RAM, theoretically you may train mini-batches with size 5 at once.
3- If the resources are not enough to train even 1-sized single batch, in this case, you may think about increasing your PC capacity or decreasing your model capacity / layers / features. Alternatively, you can go for cloud computing solutions.

What is an appropriate training set size for sentiment analysis?

I'm looking to use some tweets about measles/ the mmr vaccine to see how sentiment about vaccination changes over time. I plan on creating the training set from the corpus of data I currently have (unless someone has a recommendation on where I can get similar data).
I would like to classify a tweet as either: Pro-vaccine, Anti-Vaccine, or Neither (these would be factual tweets about outbreaks).
So the question is: How big is big enough? I want to avoid problems of overfitting (so I'll do a test train split) but as I include more and more tweets, the number of features needing to be learned increases dramatically.
I was thinking 1000 tweets (333 of each). Any input is appreciated here, and if you could recommend some resources, that would be great too.
More is always better. 1000 tweets on a 3-way split seems quite ambitious, I would even consider 1000 per class for a 3-way split on tweets quite low. Label as many as you can within a feasible amount of time.
Also, it might be worth taking a cascaded approach (esp. with so little data), i.e. label a set vaccine vs non-vaccine, and within the vaccine subset you'd have a pro vs anti set.
In my experience trying to model a catch-all "neutral" class, that contains everything that is not explicitly "pro" or "anti" is quite difficult because there is so much noise. Especially with simpler models such as Naive Bayes, I have found the cascaded approach to be working quite well.

Resources