query() of generator `max_length` being succeeded - python-3.x

Goal: set min_length and max_length in Hugging Face Transformers generator query.
I've passed 50, 200 as these parameters. Yet, the length of my outputs are much higher...
There's no runtime failure.
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
def query(payload, multiple, min_char_len, max_char_len):
print(min_char_len, max_char_len)
list_dict = generator(payload, min_length=min_char_len, max_length=max_char_len, num_return_sequences=multiple)
test = [d['generated_text'].split(payload)[1].strip() for d in list_dict]
for t in test: print(len(t))
return test
query('example', 1, 50, 200)
Output:
50 200
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
1015

Explanation:
As explained by Narsil on Hugging Face 🤗 Transformers Git Issue response
Models, don't ingest the text one character at a time, but one token
at a time. There are different algorithms to achieve this but
basically "My name is Nicolas" gets transformers into ["my", " name",
" is", " nic", "olas"] for instance, and each of those tokens have a
number.
So when you are generating tokens, they can contain themselves 1 or
more characters (usually several and almost any common word for
instance). That's why you are seeing 1015 instead of your expected 200
(the tokens here have an average of 5 chars)
Solution:
As I resolved...
Rename min_char_len, max_char_len to min_tokens, max_tokens and
simply reduce their values by a ~1/4 or 1/5.

Related

How do I know how much tokens a GPT-3 request used?

I am building an app around GPT-3, and I would like to know how much tokens every request I make uses. Is this possible and how ?
Counting Tokens with Actual Tokenizer
To do this in python, first install the transformers package to enable the GPT-2 Tokenizer, which is the same tokenizer used for [GPT-3]:
pip install transformers
Then, to tokenize the string "Hello world", you have a choice of using GPT2TokenizerFast or GPT2Tokenizer.
from transformers import GPT2TokenizerFast\
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")\
number_of_tokens = len(tokenizer("Hello world")['input_ids'])
or
from transformers import GPT2Tokenizer\
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")\
number_of_tokens = len(tokenizer("Hello world")['input_ids'])
In either case, tokenizer() produces a python list of token representing the string, which can the be counted with len(). The documentation doesn't mention any differences in behavior between the two methods. I tested both methods on both text and code and they gave the same numbers. The from_pretrained methods are unpleasantly slow: 28s for GPT2Tokenizer, and 56s for GPT2TokenizerFast. The load time dominates the experience, so I suggest NOT the "fast" method. (Note: the first time you run either of the from_pretrained methods, a 3MB model will be downloaded and installed, which takes a couple minutes.)
Approximating Token Counts
The tokenizers are slow and heavy, but approximations can be to go back and forth between them, using nothing but the number of characters or tokens. I developed the following approximations by observing the behavior of the GPT-2 tokenizer. They hold well for English text and python code. The 3rd and 4th functions are perhaps the most useful since they let us quickly fit a text in the GPT-3's token limit.
import math
def nchars_to_ntokens_approx(nchars):
#returns an estimate of #tokens corresponding to #characters nchars
return max(0,int((nchars - init_offset)*math.exp(-1)))
def ntokens_to_nchars_approx(ntokens):
#returns an estimate of #characters corresponding to #tokens ntokens
return max(0,int(ntokens*math.exp(1) ) + 2 )
def nchars_leq_ntokens_approx(maxTokens):
#returns a number of characters very likely to correspond <= maxTokens
sqrt_margin = 0.5
lin_margin = 1.010175047 #= e - 1.001 - sqrt(1 - sqrt_margin) #ensures return 1 when maxTokens=1
return max( 0, int(maxTokens*math.exp(1) - lin_margin - math.sqrt(max(0,maxTokens - sqrt_margin) ) ))
def truncate_text_to_maxTokens_approx(text, maxTokens):
#returns a truncation of text to make it (likely) fit within a token limit
#So the output string is very likely to have <= maxTokens, no guarantees though.
char_index = min( len(text), nchars_leq_ntokens_approx(maxTokens) )
return text[:char_index]
OPEN-AI charges GPT-3 usage through tokens, this counts both the prompt and the answer. For OPEN-AI 750 words would have an equivalent of around 1000 tokens or a token to word ratio of 1.4 . Pricing of the token depends of the plan you are on.
I do not know of more accurate ways of estimating cost. Perhaps using GPT-2 tokenizer from Hugging face can help. I know the tokens from the GPT-2 tokenizer are accepted when passed to GPT-3 in the logit bias array, so there is a degree of equivalence between GPT-2 tokens and GPT-3 tokens.
However GPT-2 and GPT-3 models are different and GPT-3 famously has more parameters than GPT-3 so GPT-2 estimations are probably lower token wise. I am sure you can write a simple program that estimates the price by comparing prompts and token usage, but that might take some time.
Here is an example from openai-cookbook that worked perfectly for me:
import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens
num_tokens_from_string("tiktoken is great!", "gpt2")
>6
Code to count how much tokens a GPT-3 request used:
def count_tokens(input: str):
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
res = tokenizer(input)['input_ids']
return len(res)
print(count_tokens("Hello world"))

Is this text training with skip-gram correct?

I am still a beginner with neural networks and NLP.
In this code I'm training cleaned text (some tweets) with skip-gram.
But I do not know if I do it correctly.
Can anyone inform me about the correctness of this skip-gram text training?
Any help is appreciated.
This my code :
from nltk import word_tokenize
from gensim.models.phrases import Phrases, Phraser
sent = [row.split() for row in X['clean_text']]
phrases = Phrases(sent, max_vocab_size = 50, progress_per=10000)
bigram = Phraser(phrases)
sentences = bigram[sent]
from gensim.models import Word2Vec
w2v_model = Word2Vec(window=5,
size = 300,
sg=1)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=25)
del sentences #to reduce memory usage
def get_mat(model, corpus, size):
vecs = np.zeros((len(corpus), size))
n = 0
for i in corpus.index:
vecs[i] = np.zeros(size).reshape((1, size))
for word in str(corpus.iloc[i,0]).split():
try:
vecs[i] += model[word]
#n += 1
except KeyError:
continue
return vecs
X_sg = get_vectors(w2v_model, X, 300)
del X
X_sg=pd.DataFrame(X_sg)
X_sg.head()
from sklearn import preprocessing
scale = preprocessing.normalize
X_sg=scale(X_sg)
for i in range(len(X_sg)):
X_sg[i]+=1 #I did this because some weights where negative! So could not
#apply LSTM on them later
You haven't mentioned if you've received any errors, or unsatisfactory results, so it's hard to know what kind of help you might need.
Your specific lines of code involving the Word2Vec model are roughly correct: plausibly-useful parameters (if you have a dataset large enough to train 300-dimensional vectors), and the proper steps. So the real proof would be whether your results are acceptable.
Regarding your attempted use of Phrases bigram-creation beforehand:
You should get things generally working and with promising results before adding this extra pre-processing complexity.
The parameter max_vocab_size=50 is seriously misguided and may make the phrases-step pointless. The max_vocab_size is a hard cap on how many words/bigrams are tallied by the class, as a way to cap its memory-usage. (Whenever the number of known words/bigrams hits this cap, many lower-frequency words/bigrams are pruned – in practice, a majority of all words/bigrams each pruning, giving up a lot of accuracy in return for capped memory usage.) The max_vocab_size default in gensim is 40,000,000 – but the default in the Google word2phrase.c source on which gensim's method is based was 500,000,000. By using just 50, it's not really going to learn anything useful about just whatever 50 words/bigrams survive the many prunings.
Regarding your get_mat() function & later DataFrame code, i have no idea what you're trying to do with it, so can't offer any opinion on it.

Some diverging issues of Word2Vec in Gensim using high alpha values

I am implementing word2vec in gensim, on a corpus with nested lists (collection of tokenized words in sentences of sentences form) with 408226 sentences (lists) and a total of 3150546 words or tokens.
I am getting a meaningful results (in terms of the similarity between two words using model.wv.similarity) with the chosen values of 200 as size, window as 15, min_count as 5, iter as 10 and alpha as 0.5. All are lemmatized words and these all are input to models with vocabulary as 32716.
The results incurred from default alpha value, size, window and dimensions are meaningless for me based on the used data in computing the similarity values. However higher value of alpha as 0.5 gives me some meaningful results in terms of inducing meaningful similarity scores between two words. However, when I calculate the top n similar words, it's again meaningless. Does I need to change the entire parameters used in the initial training process.
I am still unable to reveal the exact reason, why the model behaves good with such a higher alpha value in computing the similarity between two words of the used corpus, whereas it's meaningless while computing the top n similar words with scores for an input word. Why is this the case?
Does it is diverging towards optimal solution. How to check this?
Any idea why is it the case is deeply appreciated.
Note: I'm using Python 3.7 on Windows machine with anaconda prompt and giving input to the model from a file.
This is what I have tried.
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models import Word2Vec
import ast
path = "F:/Folder/"
def load_data():
global Sentences
Sentences = []
for file in ['data_d1.txt','data_d2.txt']:
with open(path + file, 'r', encoding = 'utf-8') as f1:
Sentences.extend(ast.literal_eval(*f1.readlines()))
load_data()
def initialize_word_embedding():
model = Word2Vec(Sentences, size = 200, window = 15, min_count = 5, iter = 10, workers = 4)
print(model)
print(len(model.wv.vocab))
print(model.wv.similarity(w1 = 'structure', w2 = '_structure_'))
similarities = model.wv.most_similar('system')
for word, score in similarities:
print(word , score)
initialize_word_embedding()
The example of Sentences list is as follows:
[['scientist', 'time', 'comet', 'activity', 'sublimation', 'carbon', 'dioxide', 'nears', 'ice', 'system'], ['inconsistent', 'age', 'system', 'year', 'size', 'collision'], ['intelligence', 'system'], ['example', 'application', 'filter', 'image', 'motion', 'channel', 'estimation', 'equalization', 'example', 'application', 'filter', 'system']]
The data_d1.txt and data_d2.txt is a nested list (list of lists of lemmatized tokenized words). I have preprocessed the raw data and save it in a file. Now giving the same as input. For computing the lemmatizing tokens, I have used the popular WordNet lemmatizer.
I need the word-embedding model to calculate the similarity between two words and computing the most_similar words of a given input word. I am getting some meaningful scores for the model.wv.similarity() method, whereas in calculating the most_similar() words of a word (say, system as shown in above). I am not getting the desired results.
I am guessing the model is getting diverged from the global minima, with the use of high alpha values.
I am confused what should be the dimension size, window for inducing some meaningful results, as there is no such rules regarding how to compute the the size and window.
Any suggestion is appreciated. The size of total sentences and words are specified above in the question.
Results what I am getting without setting alpha = 0.5
Edit to Recent Comment:
Results:
Word2Vec(vocab=32716, size=200, alpha=0.025)
The similarity between set and _set_ is : 0.000269373188960656
which is meaningless for me as it is very very less in terms of accuracy, But, I am a getting 71% by setting alpha as 0.5, which seems to be meaningful for me as the word set is same for both the domains.
Explanation: The word set should be same for both the domains (as I am comparing the data of two domains with same word). Don't get confused with word _set_, this is because the word is same as set, I have injected a character _ at start and end to distinguish the same for two different domains.
The top 10 words along with scores of _set_ are:
_niche_ 0.6891741752624512
_intermediate_ 0.6883598566055298
_interpretation_ 0.6813371181488037
_printer_ 0.675414502620697
_finer_ 0.6625382900238037
_pertinent_ 0.6620787382125854
_respective_ 0.6619025468826294
_converse_ 0.6610435247421265
_developed_ 0.659270167350769
_tent_ 0.6588765382766724
Whereas, the top 10 words for set are:
cardinality 0.633270263671875
typereduction 0.6233855485916138
zdzisław 0.619156002998352
crisp 0.6165326833724976
equivalenceclass 0.605925977230072
pawlak 0.6058803200721741
straight 0.6045454740524292
culik 0.6040038466453552
rin 0.6038737297058105
multisets 0.6035065650939941
Why the cosine similarity value is 0.00 for the word set for two different data.

How to check whether term is empty after pruning of tfidfvectorizer

I am using tfidfvectorizer to score terms from many different corpus.
Here is my code
tfidf = TfidfVectorizer(ngram_range=(1,1), stop_words = 'english', min_df = 0.5)
for corpus in all_corpus:
tfidf.fit_transform(corpus)
The number of documents in each corpus is various, so when building the vocabulary, some corpus remains empty and return an error:
after pruning, no terms remain. Try a lower min_df or higher max_df
I don't want to change the min or max DF. What I need is when there is no terms, the transforming process is skipped. So I made a conditional filter like below
for corpus in all_corpus:
tfidf.fit_transform(corpus)
if tfidf.shape[0] > 0:
\\execute some code here
However, the condition couldn't work. Is there way to fix this?
All answers and comments are really appreciated. Thanks
First, a minimum working example for your problem is I believe, the following:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,1), stop_words = 'english', min_df = 0.5)
tfidf.fit_transform(['not I you'])
I could not replicate an error message that contains the part of the error message you share, but this gives me a ValueError as all the words in my document are English stop words. (The code runs if one removes stop_words = 'english' in the snippet above.)
One way of handling the error in the case of a for-loop is to use a try/except block.
for corpus in all_corpus:
try:
tfidf.fit_transform(corpus)
except ValueError:
print('Transforming process skipped')
# Here you can do more stuff
continue # go to the beginning of the for-loop to start the next iteration
# Here goes the rest of the code for corpus for which the transform functioned

Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:
inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims]
print(rank)
Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.
Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?
Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks #gojomo).
def seeded_vector(self, seed_string):
"""Create one 'random' vector (but deterministic by seed_string)"""
# Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
return (once.rand(self.vector_size) - 0.5) / self.vector_size
Set negative=0 to avoid randomization:
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20, window=5, min_count=1, negative=0, workers=6, epochs=10)
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
v1 = model.infer_vector(s)
for i in range(100):
v2 = model.infer_vector(s)
assert np.all(v1 == v2), "Failed on %s" % (''.join(s))

Resources