OpenAI API: Does fine-tuning have a token limit? - openai-api

In the documentation for GPT-3 API, it says One limitation to keep in mind is that, for most models, a single API request can only process up to 2,048 tokens (roughly 1,500 words) between your prompt and completion.
In the documentation for fine tuning model, it says The more training samples you have, the better. We recommend having at least a couple hundred examples. in general, we've found that each doubling of the dataset size leads to a linear increase in model quality.
My question is, does the 1,500 words limit also apply to fine tune model? Does "Doubling of the dataset size" mean number of training datasets instead of size of each training dataset?

As far as I understand...
GPT-3 models have token limits because you can only provide 1 prompt and you only get 1 completion. Therefore, as stated in the official OpenAI article:
Depending on the model used, requests can use up to 4097 tokens shared
between prompt and completion. If your prompt is 4000 tokens, your
completion can be 97 tokens at most.
Whereas, fine-tuning as such does not have a token limit (i.e., you can have a million training examples, a million prompt-completion pairs) as stated on the official OpenAI website:
The more training examples you have, the better. We recommend having
at least a couple hundred examples. In general, we've found that each
doubling of the dataset size leads to a linear increase in model
quality.
But, each fine-tuning prompt-completion pair does have a token limit. Each fine-tuning prompt-completion pair should not exceed the token limit.

Related

Word2Vec clustering: embed with low dimensionality or with high dimensionality and then reduce?

I am using K-means for topic modelling using Word2Vec and would like to understand the implications of vectorizing up to, let's say, 10 dimensions, against embedding it with 200 dimensions and then using PCA to get down to 10. Does the second approach make sense at all?
Which one worked better for your specific purposes, & your specific data, after trying both & comparing the end-results against each other, either in some ad-hoc ("eyeballing") or rigorous way?
There's no reason to prematurely reject any approach, given how many details about your data & ultimate end-goals are unstated.
It would be atypical to train a word2vec model to have only 10 dimensions. Published work most often shows the use of 100 to 1000 dimensions, often 300 or 400, assuming you've got enough bulk training data to make the algorithm worthwhile.
(Word2vec needs a lot of varied training text, with many contrasting usage examples for every word of interest, to generate good results. You may occasionally see toy-sized demos, on smaller amounts of data, just to quickly show steps, or some major qualities of the results. But good results, in the aspects for which word2vec is most appreciated, depend on plentiful training data.)
Also, whether or not your aims would be helped by the extra step of PCA to reduce the dimensionality of a larger word2vec model seems another separable question, to be determined experimentally by comparing results with and without that step, on your actual data/problem, rather than guessed at from intuitions from other projects that might not be comparable.

PEGASUS pre-training for summarisation tasks

I am unsure of how the evaluation for large document summarisation is conducted for the recently introduced PEGASUS model for single document summarisation.
The author's show evaluation against large document datasets like Big Patent, PubMed etc with document lengths exceeding that of the input size to the transformer models.
To quote from the paper, they did talk about this but didn't really elaborate further.
CNN/DailyMail, Multi-News, arXiv, PubMed, BIG- PATENT datasets contain input documents longer than the maximum input length (L_input = 512 tokens) in pre- training. This would present a problem for position em- beddings which would never be updated for longer input lengths, but we confirm the postulation that sinusoidal po- sitional encodings (Vaswani et al., 2017) generalize well when fine-tuning PEGASUSLARGE beyond the input lengths observed in training up to L_input = 1024 tokens. Since average input length in BIGPATENT, arXiv, PubMed and Multi-News are well beyond 1024 tokens, further scaling up L_input or applying a two-stage approach (Liu et al., 2018) may improve performance even more, although this is out- side the scope of this work.
They did mention that the input length is up till 1024 tokens. In the PEGASUS Large model on huggingface the max input tokens is also 1024.
I am not sure how they managed to extend their document summarisations for more than 1024 tokens.
I would also like to do similar for my own long document summarisations that I want to try.

How to optimize memory footprint of Stanza models

I'm using Stanza to get tokens, lemmas and tags from documents in multiple languages for the purposes of a language learning app. This means that I need to store and load many Stanza (default) models for different languages.
My main problem right now is that if I want to load all those models the memory requirement is too much for my resources. I currently deploy a web API running Stanza NLP on AWS. I want to keep my infrastructure costs at a minimum.
One possible solution is to load one model at a time when I need to run my script. I guess that means there will be some extra overhead each time in order to load the model in memory.
Another thing I tried is just to use the processors that I really need which decreases the memory footprint but not by that much.
I tried looking at open and closed issues on Github and Google but didn't find much.
What other possible solutions are out there?
The bottom line is a model for a language has to be in memory during execution, so by some means or another you need to make the model smaller or tolerate storing models on disk. I can offer some suggestions to make the models smaller, though be warned that making your model smaller will probably result in poorer accuracy.
You could examine the percentage breakdown of language requests, and store commonly requested languages in memory and only go to disk for rarer language requests.
The most immediate impact strategy for reducing model size is to shrink the vocabulary size. It is possible you could cut the vocabulary even smaller and still get similar accuracy. We have done some optimization on this front, but there may be more opportunity to cut model size.
You could experiment with smaller model size and word embeddings and may only get a small accuracy drop, we haven't really aggressively experimented with different model sizes to see how much accuracy you lose. This would mean retraining the model and just setting the embedding size and model size parameters smaller.
I don't know a lot about this, but there is a strategy of tagging a bunch of data with your big accurate model, and then training a smaller model to mimic the big model. I believe this is called "knowledge distillation".
In a similar direction, you could tag a bunch of data with Stanza, and then train a CoreNLP model (which I think would have a smaller memory footprint).
In summary, I think the easiest thing to do would be to retrain a model with a smaller vocabulary size. We I think it currently has 250,000 words, and cutting to 10,000 or 50,000 will reduce model size, but may not affect accuracy too badly.
Unfortunately I don't think there is a magical option you can select that will just solve this issue, you will have to retrain models and see what kind of accuracy you are willing to sacrifice for a lower memory footprint.

Which method dm or dbow works well for document similarity using Doc2Vec?

I'm trying to find out the similarity between 2 documents. I'm using Doc2vec Gensim to train around 10k documents. There are around 10 string type of tags. Each tag consists of a unique word and contains some sort of documents. Model is trained using distributed memory method.
Doc2Vec(alpha=0.025, min_alpha=0.0001, min_count=2, window=10, dm=1, dm_mean=1, epochs=50, seed=25, vector_size=100, workers=1)
I've tried both dm and dbow as well. dm gives better result(similarity score) as compared to dbow. I understood the concepts of dm vs dbow. But don't know which method is good for similarity measures between two documents.
First question: Which method is the best to perform well on similarities?
model.wv.n_similarity(<words_1>, <words_2>) gives similarity score using word vectors.
model.docvecs.similarity_unseen_docs(model, doc1, doc2) gives similarity score using doc vectors where doc1 and doc2 are not tags/ or indexes of doctags. Each doc1 and doc2 contains 10-20 words kind of sentences.
Both wv.n_similarity and docvecs.similarity_unseen_docs provide different similarity scores on same types of documents.
docvecs.similarity_unseen_docs gives little bit good results as compared to wv.n_similarity but wv.n_similarity sometimes also gives good results.
Question: What is the difference between docvecs.similarity_unseen_docs and wv.n_similarity? Can I use docvecs.similarity_unseen_docs to find the similarity score between unseen data (It might be a silly question)?
Why I asked because docvecs.similarity_unseen_docs provides similarity score on tags, not on actual words belonging to their tags. I'm not sure, please correct me here, if I'm wrong.
How can I convert cosine similarity score to probability?
Thanks.
model = Doc2Vec(alpha=0.025, min_alpha=0.0001, min_count=2, window=10, dm=1, dm_mean=1, epochs=50, seed=25, vector_size=100, workers=4)
# Training of the model
tagged_data = [TaggedDocument(words=_d, tags=[str(i)]) for i, _d in enumerate(<list_of_list_of_tokens>)]
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)
# Finding similarity score
model.wv.n_similarity(<doc_words1>, <doc_words2>)
model.random.seed(25)
model.docvecs.similarity_unseen_docs(model, <doc_words1>, <doc_words2>)
Both PV-DM mode (dm=1, the default) and PV-DBOW mode (dm=0) can work well. Which is better will depend on your data and goals. Once you have a robust way to quantitatively score the quality of a model's results, for your project goals – which you'll want to be able to tune all of the model's meta-parameters, including DM/DBOW mode – you can and should try both.
PV-DBOW trains fast, and often works very well on shortish-docs (a few dozens of words). Note, though, that this mode doesn't train usable word-vectors unless you also add the dbow_words=1 option, which will slow training.
Using model.wv.n_similarity() relies on word-vectors only. It averages each set f word-vectors, then reports the cosine-similarity between those two averages. (So, it will only be sensible in PV-DM mode, or PV-DBOW with dbow_words=1 activated.
Using model. docvecs.similarity_unseen_docs() uses infer_vector() to treat each of the supplied docs as new texts, for which a true Doc2Vec doc-vector (not a mere average-of-word-vectors) is calculated. (This method operates on lists-of-words, not lists-of-tags.)
Which is better is something you should test for your goals. The average-of-word-vectors is a simpler, faster technique for making a text-vector – but still works ok for a lot of purposes. The inferred doc-vectors take longer to calculate, but with a good model, may be better for some tasks.
Other notes on your setup:
often, setting min_count as low as 2 is a bad idea: those rare words don't have enough examples to mean much, and actually interfere with the quality of surrounding words
10k documents is on the smallish side for a training corpus, compared to published Doc2Vec results (which usually use tens-of-thousands to millions of documents).
published results often use 10-20 training epochs (though more, like your choice of 50, might be helpful especially for smaller corpuses)
on typical multi-core machines workers=1 will be much slower than the default (workers=3); on a machine with 8 or more cores, up to workers=8 is often a good idea. (Though, unless using the newer corpus_file input option, more workers up to the full count of 16, 32, etc cores doesn't help.)
classic Doc2Vec usage doesn't assign docs just known labels (as in your "10 string type of tags"), but unique IDs for each document. In some cases using, or adding, known labels as tags may help, but beware that if you're only supplying 10 tags, you've essentially turned your 10,000 documents into 10 documents (from the perspective of the model's view, which sees all texts with the same tag as if they were segments of one larger document with that tag). In plain PV-DBOW, training only 10 doc-vectors, of 100-dimensions each, from just 10 distinct examples wouldn't make much sense: it'd be prone to severe overfitting. (In PV-DM or PV-DBOW with dbow_words, the fact that the model is training both 10 doc-vectors and many hundreds/thousands of other vocabulary-word word-vectors would help offset the risk of overfitting.)

What is an appropriate training set size for sentiment analysis?

I'm looking to use some tweets about measles/ the mmr vaccine to see how sentiment about vaccination changes over time. I plan on creating the training set from the corpus of data I currently have (unless someone has a recommendation on where I can get similar data).
I would like to classify a tweet as either: Pro-vaccine, Anti-Vaccine, or Neither (these would be factual tweets about outbreaks).
So the question is: How big is big enough? I want to avoid problems of overfitting (so I'll do a test train split) but as I include more and more tweets, the number of features needing to be learned increases dramatically.
I was thinking 1000 tweets (333 of each). Any input is appreciated here, and if you could recommend some resources, that would be great too.
More is always better. 1000 tweets on a 3-way split seems quite ambitious, I would even consider 1000 per class for a 3-way split on tweets quite low. Label as many as you can within a feasible amount of time.
Also, it might be worth taking a cascaded approach (esp. with so little data), i.e. label a set vaccine vs non-vaccine, and within the vaccine subset you'd have a pro vs anti set.
In my experience trying to model a catch-all "neutral" class, that contains everything that is not explicitly "pro" or "anti" is quite difficult because there is so much noise. Especially with simpler models such as Naive Bayes, I have found the cascaded approach to be working quite well.

Resources