Mt5 language translation - Language in parameters - nlp

I'm trying to train a Arabic to English language translation model, and want to know if there is any option where I can mention the input and target language in the code.
I'm using the below code, where df1 is the dataframe with columns (prefix,input_text,target_text)
from simpletransformers.t5 import T5Model, T5Args
model = T5Model("mt5", "google/mt5-small", args=model_args, use_cuda=False)
model.train_model(df1,eval_data=df2)
results = model.eval_model(df2, verbose=True)

Related

How do I make a paraphrase generation using BERT/ GPT-2

I am trying hard to understand how to make a paraphrase generation using BERT/GPT-2. I cannot understand how do I make it. Could you please provide me with any resources where I will be able to make a paraphrase generation model?
"The input would be a sentence and the output would be a paraphrase of the sentence"
Here is my recipe for training a paraphraser:
Instead of BERT (encoder only) or GPT (decoder only) use a seq2seq model with both encoder and decoder, such as T5, BART, or Pegasus. I suggest using the multilingual T5 model that was pretrained for 101 languages. If you want to load embeddings for your own language (instead of using all 101), you can follow this recipe.
Find a corpus of paraphrases for your language and domain. For English, ParaNMT, PAWS, and QQP are good candidates. A corpus called Tapaco, extracted from Tatoeba, is a paraphrasing corpus that covers 73 languages, so it is a good starting point if you cannot find a paraphrase corpus for your language.
Fine-tune your model on this corpus. The code can be something like this:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
# use here a backbone model of your choice, e.g. google/mt5-base
backbone_model = 'cointegrated/rut5-base-multitask'
model = T5ForConditionalGeneration.from_pretrained(backbone_model)
tokenizer = T5Tokenizer.from_pretrained(backbone_model)
model.cuda();
optimizer = torch.optim.Adam(params=[p for p in model.parameters() if p.requires_grad], lr=1e-5)
# todo: load the paraphrasing corpus and define the get_batch function
for i in range(100500):
xx, yy = get_batch(mult=mult)
x = tokenizer(xx, return_tensors='pt', padding=True).to(model.device)
y = tokenizer(yy, return_tensors='pt', padding=True).to(model.device)
# do not force the model to predict pad tokens
y.input_ids[y.input_ids==0] = -100
loss = model(
input_ids=x.input_ids,
attention_mask=x.attention_mask,
labels=y.input_ids,
decoder_attention_mask=y.attention_mask,
return_dict=True
).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
model.save_pretrained('my_paraphraser')
tokenizer.save_pretrained('my_paraphraser')
A more complete version of this code can be found in this notebook.
After the training, the model can be used in the following way:
from transformers import pipeline
pipe = pipeline(task='text2text-generation', model='my_paraphraser')
print(pipe('Here is your text'))
# [{'generated_text': 'Here is the paraphrase or your text.'}]
If you want your paraphrases to be more diverse, you can control the generation procress using arguments like
print(pipe(
'Here is your text',
encoder_no_repeat_ngram_size=3, # make output different from input
do_sample=True, # randomize
num_beams=5, # try more options
max_length=128, # longer texts
))
Enjoy!
you can use T5 paraphrasing for generating paraphrases

Customize spacy stop words and save the model

I am using this to add stopwords to the spacy's list of stopwords
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}
However, when I save the nlp object using nlp.to_disk() and load it back again with nlp.from_disk(),
I am loosing the list of custom stop words.
Is there a way to save the custom stopwords with the nlp model?
Thanks in advance
Most language defaults (stop words, lexical attributes, and syntax iterators) are not saved with the model.
If you want to customize them, you can create a custom language class, see: https://spacy.io/usage/linguistic-features#language-subclass. An example copied from this link:
from spacy.lang.en import English
class CustomEnglishDefaults(English.Defaults):
stop_words = set(["custom", "stop"])
class CustomEnglish(English):
lang = "custom_en"
Defaults = CustomEnglishDefaults
nlp1 = English()
nlp2 = CustomEnglish()
print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")])
print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")])

What would the equivalent machine learning program in R language of this Python one?

As part of a school assignment on DSL and code generation, I have to translate the following program written in Python/Scikit-learn into R language (the topic of the exercise is an hypothetic Machine Learning DSL).
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
df = pd.read_csv('boston.csv', sep=',')
df.head()
y = df["medv"]
X = df.drop(columns=["medv"])
clf = DecisionTreeRegressor()
scoring = ['neg_mean_absolute_error','neg_mean_squared_error']
results = cross_validate(clf, X, y, cv=6,scoring=scoring)
print('mean_absolute_errors = '+str(results['test_neg_mean_absolute_error']))
print('mean_squared_errors = '+str(results['test_neg_mean_squared_error']))
Since I'm a perfect newbie in Machine Learning, and especially in R, I can't do it.
Could someone help me ?
Sorry for the late answer, probably you have already finished your school assignment. Of course we cannot just do it for you, you probably have to figure it out by yourself. Moreover, I don't get exactly what you need to do. But some tips are:
Read a csv file
data <-read.csv(file="name_of_the_file", header=TRUE, sep=",")
data <-as.data.frame(data)
The header=TRUE indicates that the file has one row which includes the names of the columns, the sep=',' is the same as in python (the seperator in the file is ',')
The as.data.frame makes sure that your data is kept in a dataframe format.
Add/delete a column
data<- data[,-"name_of_the_column_to_be_deleted"] #delete a column
data$name_of_column_to_be_added<- c(1:10) #add column
In order to add a column you will need to add the elements it will include. Also the # symbol indicates the beginning of a comment.
Modelling
For the modelling part I am not sure about what you want to achieve, but R offers a huge selection of algorithms to choose from (i.e. if you want to grow a tree take a look into the page https://www.statmethods.net/advstats/cart.html where it uses the following script to grow a tree
fit <- rpart(Kyphosis ~ Age + Number + Start,
method="class", data=kyphosis))

How to use mozilla deepspeech to convert speech to text using it's pre-trained model?

I want to convert speech to text using mozilla deepspeech. But the output is really bad.
I have downloaded mozilla's pre trained model and then what i have done is this:
BEAM_WIDTH = 500
LM_WEIGHT = 1.50
VALID_WORD_COUNT_WEIGHT = 2.10
N_FEATURES = 26
N_CONTEXT = 9
ds = Model(model, N_FEATURES, N_CONTEXT, alphabet, BEAM_WIDTH)
fs,audio = wav.read(path)
data = audio[:,0] ## changing to mono channel (using only one channel)
prediction = ds.stt(data,fs)
print(test)
print(prediction)
Now the output is nowhere near to my audio sample. What do i have to do to increase it's accuracy?
I assume it's because you are not including any LanguageModel.
The pre-trained model is basically just the acoustic model which will only transcribe the audio to similar sounding text that may not make sense.
If you combine the acoustic model with a language model (LM) you will likely get better results.
In your code example I can see the Parameter LM_WEIGHT but not any refenrence to the LM itself.
I'm unsure in which Language you want to integrate deepspeech but here is the example for node-js. This is the part where the LM is integrated
const LM_ALPHA = 0.75;
const LM_BETA = 1.85;
let lmPath = './models/lm.binary';
let triePath = './models/trie';
model.enableDecoderWithLM(lmPath, triePath, LM_ALPHA, LM_BETA);
If I'm not mistaken, the LM & Trie file is included in the pre-trained download ZIP
wget https://github.com/mozilla/DeepSpeech/releases/download/v0.5.1/deepspeech-0.5.1-models.tar.gz
Otherwise you can also create your own Language Model which would make sense if you only need the Model to recognize specific words.

Natural Language Processing Model

I'm a beginner in NLP and making a project to parse, and understand the intentions of input lines by a user in english.
Here is what I think I should do:
Create a text of sentences with POS tagging & marked intentions for every sentence by hand.
Create a model say: decision tree and train it on the above sentences.
Try the model on user input:
Do basic tokenizing and POS tagging on user input sentence and testing it on the above model for knowing the intention of this sentence.
It all may be completely wrong or silly but I'm determined to learn how to do it. I don't want to use ready-made solutions and the programming language is not a concern.
How would you guys do this task? Which model to choose and why? Normally to make NLP parsers, what steps are done.
Thanks
I would use NLTK.
There is an online book with a chapter on tagging, and a chapter on parsing. They also provide models in python.
Here is a simple example based on NLTK and Bayes
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)),category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)
]
random.shuffle(documents)
all_words = [w.lower() for w in movie_reviews.words()]
for w in movie_reviews.words():
all_words.append(w.lower())git b
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
print((find_features(movie_reviews.words("neg/cv000_29416.txt"))))
featuresets = [(find_features(rev),category) for (rev,category) in documents ]
training_set =featuresets[:10]
testing_set = featuresets[1900:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo Accuracy: ",(nltk.classify.accuracy(classifier,testing_set))* 100 )

Resources