GPT-2: How do I speed up/optimize token text generation? - openai-api

I am trying to generate a 20 token text using GPT-2 simple. It is taking me around 15 seconds to generate the sentence. AI Dungeon is taking around 4 seconds to generate the same size sentence.
Is there a way to fasten/optimize the GPT-2 text generation?

I think they have quicker results because their program is better optimized and they have greater computing power. They pay a lot for server. As well, Ai Dungeon uses GPT-3 which might be just faster. I'm as well struggling with speed of GPT-2. Let me know if you figured anything.
Cheers

Text generation models like GPT-2 are slow, and it is of course even worse with bigger models like GPT-J and GPT-NeoX.
If you want to speed up your text generation you have a couple of options:
Use a GPU. GPT-2 doesn't require too much VRAM so an entry level GPU will do. On a GPU, generating 20 tokens with GPT-2 shouldn't take more than 1 second.
Quantize your model and convert it to TensorRT. See this good tutorial: https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace/GPT2
Serve it through a dedicated inference server (like TorchServe or Triton Inference Server).
I actually wrote an article about how to speed up inference of transformer based models. You might find it helpful: how to speed up deep learning inference

You can use the OpenVINO optimized version of GPT-2 model. The demo can be found here. It should be much faster as it's heavily optimized.

Related

Word2Vec clustering: embed with low dimensionality or with high dimensionality and then reduce?

I am using K-means for topic modelling using Word2Vec and would like to understand the implications of vectorizing up to, let's say, 10 dimensions, against embedding it with 200 dimensions and then using PCA to get down to 10. Does the second approach make sense at all?
Which one worked better for your specific purposes, & your specific data, after trying both & comparing the end-results against each other, either in some ad-hoc ("eyeballing") or rigorous way?
There's no reason to prematurely reject any approach, given how many details about your data & ultimate end-goals are unstated.
It would be atypical to train a word2vec model to have only 10 dimensions. Published work most often shows the use of 100 to 1000 dimensions, often 300 or 400, assuming you've got enough bulk training data to make the algorithm worthwhile.
(Word2vec needs a lot of varied training text, with many contrasting usage examples for every word of interest, to generate good results. You may occasionally see toy-sized demos, on smaller amounts of data, just to quickly show steps, or some major qualities of the results. But good results, in the aspects for which word2vec is most appreciated, depend on plentiful training data.)
Also, whether or not your aims would be helped by the extra step of PCA to reduce the dimensionality of a larger word2vec model seems another separable question, to be determined experimentally by comparing results with and without that step, on your actual data/problem, rather than guessed at from intuitions from other projects that might not be comparable.

How to optimize memory footprint of Stanza models

I'm using Stanza to get tokens, lemmas and tags from documents in multiple languages for the purposes of a language learning app. This means that I need to store and load many Stanza (default) models for different languages.
My main problem right now is that if I want to load all those models the memory requirement is too much for my resources. I currently deploy a web API running Stanza NLP on AWS. I want to keep my infrastructure costs at a minimum.
One possible solution is to load one model at a time when I need to run my script. I guess that means there will be some extra overhead each time in order to load the model in memory.
Another thing I tried is just to use the processors that I really need which decreases the memory footprint but not by that much.
I tried looking at open and closed issues on Github and Google but didn't find much.
What other possible solutions are out there?
The bottom line is a model for a language has to be in memory during execution, so by some means or another you need to make the model smaller or tolerate storing models on disk. I can offer some suggestions to make the models smaller, though be warned that making your model smaller will probably result in poorer accuracy.
You could examine the percentage breakdown of language requests, and store commonly requested languages in memory and only go to disk for rarer language requests.
The most immediate impact strategy for reducing model size is to shrink the vocabulary size. It is possible you could cut the vocabulary even smaller and still get similar accuracy. We have done some optimization on this front, but there may be more opportunity to cut model size.
You could experiment with smaller model size and word embeddings and may only get a small accuracy drop, we haven't really aggressively experimented with different model sizes to see how much accuracy you lose. This would mean retraining the model and just setting the embedding size and model size parameters smaller.
I don't know a lot about this, but there is a strategy of tagging a bunch of data with your big accurate model, and then training a smaller model to mimic the big model. I believe this is called "knowledge distillation".
In a similar direction, you could tag a bunch of data with Stanza, and then train a CoreNLP model (which I think would have a smaller memory footprint).
In summary, I think the easiest thing to do would be to retrain a model with a smaller vocabulary size. We I think it currently has 250,000 words, and cutting to 10,000 or 50,000 will reduce model size, but may not affect accuracy too badly.
Unfortunately I don't think there is a magical option you can select that will just solve this issue, you will have to retrain models and see what kind of accuracy you are willing to sacrifice for a lower memory footprint.

How large of testing sample size is sufficient for testing a new method?

I have developed a new image segmentation technique, and now I want to evaluate the performance. I am wondering how many sample do I need to perform the evaluation? In other words, how large is the testing sample set is sufficient in order to evaluate the new method? Any theory backup for that? Thanks.
There are standard computer vision datasets to benchmark segmentation. Example: http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/
You would have to report how your algorithm performs on these. Considering that the number of all possible images in the world is pretty big, these would constitute a good sample. ;-)

News Article Categorization (Subject / Entity Analysis via NLP?); Preferably in Node.js

Objective: a node.js function that can be passed a news article (title, text, tags, etc.) and will return a category for that article ("Technology", "Fashion", "Food", etc.)
I'm not picky about exactly what categories are returned, as long as the list of possible results is finite and reasonable (10-50).
There are Web APIs that do this (eg, alchemy), but I'd prefer not to incur the extra cost (both in terms of external HTTP requests and also $$) if possible.
I've had a look at the node module "natural". I'm a bit new to NLP, but it seems like maybe I could achieve this by training a BayesClassifier on a reasonable word list. Does this seem like a good/logical approach? Can you think of anything better?
I don't know if you are still looking for an answer, but let me put my two cents for anyone who happens to come back to this question.
Having worked in NLP i would suggest you look into the following approach to solve the problem.
Don't look for a single package solution. There are great packages out there, no doubt for lots of things. But when it comes to active research areas like NLP, ML and optimization, the tools tend to be atleast 3 or 4 iterations behind whats there is academia.
Coming to the core problem. What you want to achieve is text classification.
The simplest way to achieve this would be an SVM multiclass classifier.
Simplest yes, but also with very very (see the double stress) reasonable classification accuracy, runtime performance and ease of use.
The thing which you would need to work on would be the feature set used to represent your news article/text/tag. You could use a bag of words model. add named entities as additional features. You can use article location/time as features. (though for a simple category classification this might not give you much improvement).
The bottom line is. SVM works great. they have multiple implementations. and during runtime you don't really need much ML machinery.
Feature engineering on the other hand is very task specific. But given some basic set of features and a good labelled data you can train a very decent classifier.
here are some resources for you.
http://svmlight.joachims.org/
SVM multiclass is what you would be interested in.
And here is a tutorial by SVM zen himself!
http://www.cs.cornell.edu/People/tj/publications/joachims_98a.pdf
I don't know about the stability of this but from the code its a binary classifier SVM. which means if you have a known set of tags of size N you want to classify the text into, you will have to train N binary SVM classifiers. One each for the N category tags.
Hope this helps.

Computational Learning theory based on PAC-learning framework

Consider a Machine Learning Algorithm which train from a training set, with the help of PAC learning model we get bounds on training sample size needed so the probability that error is limited(by epsilon) is bounded(by delta).
What does PAC learning model say about computational(time) complexity.
Suppose a Learning Algorithm is given more time(like more iterations) how the error and probability that error is limited changes
As an learning algorithm which takes one hour to train is of no practical use in financial prediction problems. I need how the performance changes as time given to algorithm changes both in terms of error bounds and what is the probability that error is bounded
The PAC model simply tells you how many pieces of data you need to get a certain level of error with some probability. This can be translated into the impact on the run time by looking at the actual machine learning algorithm your using.
For example, if your algorithm runs in time O(2^n), and the PAC model says you need 1000 examples to have a 95% chance of having .05 error and 10,000 example for .005 error, then you know you should expect a HUGE slowdown for the increased accuracy. Whereas the same PAC information for a O(log n) algorithm would probably lead you to go ahead and get the lower error.
On a side note, it sounds like you might be confused about how most supervised learning algorithms work:
Suppose a Learning Algorithm is given more time(like more iterations) how the error and probability that error is limited changes
In most cases you can't really just give the same algorithm more time and expect better results, unless you chance the parameters (e.g. learning rate) or increase the number of examples. Perhaps by 'iterations' you meant examples, in which case the impact of the number of examples on the probably and error rate can be found by manipulating the system of equations used for the PAC learning model; see the wiki article.

Resources