how to predict a character based on character based RNN model?

how to predict a character based on character based RNN model? - nlp

i want to create a prediction function which complete a part of "sentence"
the model used here is a character based RNN(LSTM). what are the steps we should fellow ?
i tried this but i can't give as input the sentence
def generate(self) -> Tuple[List[Token], torch.tensor]:
start_symbol_idx = self.vocab.get_token_index(START_SYMBOL, 'tokens')
# print(start_symbol_idx)
end_symbol_idx = self.vocab.get_token_index(END_SYMBOL, 'tokens')
padding_symbol_idx = self.vocab.get_token_index(DEFAULT_PADDING_TOKEN, 'tokens')
log_likelihood = 0.
words = []
state = (torch.zeros(1, 1, self.hidden_size), torch.zeros(1, 1, self.hidden_size))
word_idx = start_symbol_idx
for i in range(self.max_len):
tokens = torch.tensor([[word_idx]])
embeddings = self.embedder({'tokens': tokens})
output, state = self.rnn._module(embeddings, state)
output = self.hidden2out(output)
log_prob = torch.log_softmax(output[0, 0], dim=0)
dist = torch.exp(log_prob)
word_idx = start_symbol_idx
while word_idx in {start_symbol_idx, padding_symbol_idx}:
word_idx = torch.multinomial(
dist, num_samples=1, replacement=False).item()
log_likelihood += log_prob[word_idx]
if word_idx == end_symbol_idx:
break
token = Token(text=self.vocab.get_token_from_index(word_idx, 'tokens'))
words.append(token)
return words, log_likelihood,start_symbol_idx

Here are two tutorial on how to use machine learning libraries to generate text Tensorflow and PyTorch.

this code snippet is the part of allennlp "language model" tutorial, here the generate function is defined to compute the probability of tokens and find the best token and sequence of tokens according to the maximum likelihood of model output, the full code is in the colab notebook bellow you can refer to it: https://colab.research.google.com/github/mhagiwara/realworldnlp/blob/master/examples/generation/lm.ipynb#scrollTo=8AU8pwOWgKxE
after training the the language model for using this function you can say:
for _ in range(50):
tokens, _ = model.generate()
print(''.join(token.text for token in tokens))

Related

RuntimeError: Trying to backward through the graph a second time. Saved intermediate values of the graph are freed when you call .backward()

I am trying to train SRGAN from scratch. I have read solutions for this type of problem, but it would be great if someone could help me debug my code. The exact error is: "RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad()" Here is the snippet I am trying to train:
gen_model = Generator().to(device, non_blocking=True)
disc_model = Discriminator().to(device, non_blocking=True)
opt_gen = optim.Adam(gen_model.parameters(), lr=0.01)
opt_disc = optim.Adam(disc_model.parameters(), lr=0.01)
from torch.nn.modules.loss import BCELoss
def train_model(gen, disc):
for epoch in range(20):
run_loss_disc = 0
run_loss_gen = 0
for data in train:
low_res, high_res = data[0].to(device, non_blocking=True, dtype=torch.float).permute(0, 3, 1, 2),data[1].to(device, non_blocking=True, dtype=torch.float).permute(0, 3, 1, 2)
#--------Discriminator-----------------
gen_image = gen(low_res)
gen_image = gen_image.detach()
disc_gen = disc(gen_image)
disc_real = disc(high_res)
p=nn.BCEWithLogitsLoss()
loss_gen = p(disc_real, torch.ones_like(disc_real))
loss_real = p(disc_gen, torch.zeros_like(disc_gen))
loss_disc = loss_gen + loss_real
opt_disc.zero_grad()
loss_disc.backward()
run_loss_disc+=loss_disc
#---------Generator--------------------
cont_loss = vgg_loss(high_res, gen_image)
adv_loss = 1e-3*p(disc_gen, torch.ones_like(disc_gen))
gen_loss = cont_loss+(10^-3)*adv_loss
opt_gen.zero_grad()
gen_loss.backward()
opt_disc.step()
opt_gen.step()
run_loss_gen+=gen_loss
print("Run Loss Discriminator: %d", run_loss_disc)
print("Run Loss Generator: %d", run_loss_gen)
train_model(gen_model, disc_model)

Apparently your disc_gen value was discarded by the first backward() call, as it says.
It should work if you change the discriminator part a bit:
gen_image = gen(low_res)
disc_gen = disc(gen_image.detach())
and add this at the start of the generator part:
disc_gen = disc(gen_image)

Token indices sequence length is longer than the specified maximum sequence length for this model (28627 > 512)

I am using BERT's Huggingface DistilBERT model as a backend for a question and answer application. The text I am using with which to train the model is one very large single text field. Even though the text field is a single string, the punctuation was left in place as a clue for BERT. When I execute the application I am getting the "Token indices sequence length error". I am using the transformer.encodeplus() method to pass the text into the model. I have tried various mechanisms to truncate the input ids to a length <= to 512.
I am currently using Windows 10 but I will also be porting the code to a Raspberry Pi 4 platform.
The code is failing at this line:
start_scores, end_scores = model(torch.tensor([input_ids]), attention_mask=torch.tensor([attention_mask]))
I am attempting to perform the truncation at this line:
encoding = tokenizer.encode_plus(question, tokenizer(context, truncation=True).input_ids)
The entire code is here:
from transformers import AutoTokenizer, DistilBertTokenizer, DistilBertForQuestionAnswering
import torch
# globals - set once used everywhere
tokenizer = None
model = None
context = ''
def establishSettings():
global tokenizer, model, context
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', return_token_type_ids=True, model_max_length=512)
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad', return_dict=False)
# context = "Some 1,500 volcanoes are still considered potentially active around the world today 161 of those over 10 percent sit within the boundaries of the United States."
# get the volcano corpus
with open('volcanic.corpus', encoding="utf8") as file:
context = file.read().replace('\n', '')
print(len(tokenizer(context, truncation=True).input_ids))
def askQuestion(question):
global tokenizer, model, context
print("\nQuestion ", question)
encoding = tokenizer.encode_plus(question, tokenizer(context, truncation=True).input_ids)
input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]
start_scores, end_scores = model(torch.tensor([input_ids]), attention_mask=torch.tensor([attention_mask]))
ans_tokens = input_ids[torch.argmax(start_scores): torch.argmax(end_scores) + 1]
answer_tokens = tokenizer.convert_ids_to_tokens(ans_tokens, skip_special_tokens=True)
#all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
return answer_tokens
def main():
# set the global itmes once
establishSettings()
# ask a question
question = "How many potentially active volcanoes are there in the world today?"
answer_tokens = askQuestion(question)
print("answer_tokens: ", answer_tokens)
if len(answer_tokens) == 0:
answer = "Sorry, I don't have an answer for that one. Ask me another question about New Mexico volcanoes."
print(answer)
else:
answer_tokens_to_string = tokenizer.convert_tokens_to_string(answer_tokens)
print("\nFinal Answer : ")
print(answer_tokens_to_string)
if __name__ == '__main__':
main()
What is the best way to truncate the input.ids to <= 512 in length.

Edit this line:
encoding = tokenizer.encode_plus(question, tokenizer(context, truncation=True).input_ids)
to
encoding = tokenizer.encode_plus(question, tokenizer(context, truncation=True, max_length=512).input_ids)

How to efficient batch-process in huggingface?

I am using Huggingface library and transformers to find whether a sentence is well-formed or not. I am using a masked language model called XLMR. I first tokenize my sentence, and then mask each word of the sentence one by one, and then process the masked sentences and find the probability that the predicted masked word is right.
def calculate_scores(sent, model, tokenizer, device, print_pred=False, maskval=False):
k = 0
dic = {}
ls = tokenizer.batch_encode_plus(sent)
input_list = ls.input_ids
h=0
with torch.no_grad():
for i in tqdm(range(len(input_list))):
item = input_list[i]
real_input = item
attmask = [1]*len(item)
seg = [0]*len(item)
seglist = [seg]
masked_list = [real_input]
attlist = [attmask]
for j in range(1, len(item)-1):
input = copy.deepcopy(real_input)
input[j] = 50264
masked_list.append(input)
attlist.append(attmask)
seglist.append(seg)
inid = torch.tensor(masked_list)
segtensor = torch.tensor(seglist)
atttensor = torch.tensor(attlist)
inid=inid.to(device)
segtensor=segtensor.to(device)
output = model(inid, segtensor)
predictions_logits = output.logits
predictions = torch.softmax(predictions_logits, dim=2)
ppscore = 0
for j in range(1, len(item)-1):
ppscore = ppscore+math.log(predictions[j, j, item[j]], 2)
try:
score = math.pow(2, (-1/(len(item)-2))*ppscore)
dic[sent[i]] = score
except:
print(sent[i])
dic[sent[i]] = 10000000
# dic[sent[i]]=10000000
return dic
I will explain my code quickly. The function calculate_scores has sent as an input which is a list of sentences. I first batch encode this list of sentences. And then for each encoded sentence that I get, I generate masked sentences where only one word is masked and the rest are un-masked. Then I input these generated sentences to output and get the probability. Then I compute perplexity.
But the way I'm using this is not a very good way of utilizing GPU. I want to process multiple sentences at once but at the same time, I also need to find the perplexity scores for each sentence. How would I go about doing this?

Pytorch build seq2seq MT model ,But how to get the translation results from the output tensor?

I am try to implement my own MT engine, i am following the steps in https://github.com/bentrevett/pytorch-seq2seq/blob/master/1%20-%20Sequence%20to%20Sequence%20Learning%20with%20Neural%20Networks.ipynb
SRC = Field(tokenize=tokenize_en,
init_token='<sos>',
eos_token='<eos>',
lower=True)
TRG = Field(tokenize=tokenize_de,
init_token='<sos>',
eos_token='<eos>',
lower=True)
After training the model,the link only share a way to batch evaluate but i want to try single string and get the translation results. for example i want my model to translate the input "Boys" and get the German translations.
savedfilemodelpath='./pretrained_model/2020-09-27en-de.pth'
model.load_state_dict(torch.load(savedfilemodelpath))
model.eval()
inputstring = 'Boys'
processed=SRC.process([SRC.preprocess(inputstring)]).to(device)
output=model(processed,processed)
output_dim = output.shape[-1]
outputs = output[1:].view(-1, output_dim)
for item in outputs:
print('item shape is {} and item.argmax is {}, and words is {}'.format(item.shape,item.argmax(),TRG.vocab.itos[item.argmax()]))
So my question is that it it right to get the translation results by:
First: convert the string to tensor
inputstring = 'Boys'
processed=SRC.process([SRC.preprocess(inputstring)]).to(device)
Second: send the tensor to the model. As the model have a TRG param.I have to give the tensor,am i able not given the TRG tensor?
output=model(processed,processed)
output_dim = output.shape[-1]
outputs = output[1:].view(-1, output_dim)
Third：through the return tensor, i use the argmax to get the translation results? is it right?
Or how can i get the right translation results?
for item in outputs:
print('item shape is {} and item.argmax is {}, and words is {}'.format(item.shape,item.argmax(),TRG.vocab.itos[item.argmax()+1]))

i got the answer from the translate_sentence.really thanks #Aladdin Persson
def translate_sentence(model, sentence, SRC, TRG, device, max_length=50):
# print(sentence)
# sys.exit()
# Create tokens using spacy and everything in lower case (which is what our vocab is)
if type(sentence) == str:
tokens = [token.text.lower() for token in spacy_en(sentence)]
else:
tokens = [token.lower() for token in sentence]
# print(tokens)
# sys.exit()
# Add <SOS> and <EOS> in beginning and end respectively
tokens.insert(0, SRC.init_token)
tokens.append(SRC.eos_token)
# Go through each english token and convert to an index
text_to_indices = [SRC.vocab.stoi[token] for token in tokens]
# Convert to Tensor
sentence_tensor = torch.LongTensor(text_to_indices).unsqueeze(1).to(device)
# Build encoder hidden, cell state
with torch.no_grad():
hidden, cell = model.encoder(sentence_tensor)
outputs = [TRG.vocab.stoi["<sos>"]]
for _ in range(max_length):
previous_word = torch.LongTensor([outputs[-1]]).to(device)
with torch.no_grad():
output, hidden, cell = model.decoder(previous_word, hidden, cell)
best_guess = output.argmax(1).item()
outputs.append(best_guess)
# Model predicts it's the end of the sentence
if output.argmax(1).item() == TRG.vocab.stoi["<eos>"]:
break
translated_sentence = [TRG.vocab.itos[idx] for idx in outputs]
# remove start token
return translated_sentence[1:]
And the translation is not generated once. Acutualy it generate once a time and use several times.

How do I extract x co-ordinate of a point using Python

I'm trying to build an NMF model for topic extraction. For re-training of the model, I've to pass a parameter to the nmf function, for which I need to pass the x co-ordinate from a given point that the algorithm returns, here is the code for reference:
no_features = 1000
no_topics = 9
print ('Old number of topics: ', no_topics)
tfidf_vectorizer = TfidfVectorizer(max_df = 0.95, min_df = 2, max_features = no_features, stop_words = 'english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
no_topics = tfidf.shape
print('New number of topics :', no_topics)
# nmf = NMF(n_components = no_topics, random_state = 1, alpha = .1, l1_ratio = .5, init = 'nndsvd').fit(tfidf)
On the third last line, the tfidf.shape returns a point (3,1000) to the variable 'no_topics', however I want that variable to be set to only the x co-ordinate, i.e (3).
How can I extract just the x co-ordinate from the point?

you can select the first values with no_topics[0]
print('New number of topics : {}'.format(no_topics[0]))

You can do a slicing on your numpy array tfidf with
topics = tfidf[0,:]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to predict a character based on character based RNN model? - nlp

Here are two tutorial on how to use machine learning libraries to generate text Tensorflow and PyTorch.

Related

RuntimeError: Trying to backward through the graph a second time. Saved intermediate values of the graph are freed when you call .backward()

Token indices sequence length is longer than the specified maximum sequence length for this model (28627 > 512)

How to efficient batch-process in huggingface?

Pytorch build seq2seq MT model ,But how to get the translation results from the output tensor?

How do I extract x co-ordinate of a point using Python

Categories

Resources