How to use mozilla deepspeech to convert speech to text using it's pre-trained model? - speech-to-text

I want to convert speech to text using mozilla deepspeech. But the output is really bad.
I have downloaded mozilla's pre trained model and then what i have done is this:
BEAM_WIDTH = 500
LM_WEIGHT = 1.50
VALID_WORD_COUNT_WEIGHT = 2.10
N_FEATURES = 26
N_CONTEXT = 9
ds = Model(model, N_FEATURES, N_CONTEXT, alphabet, BEAM_WIDTH)
fs,audio = wav.read(path)
data = audio[:,0] ## changing to mono channel (using only one channel)
prediction = ds.stt(data,fs)
print(test)
print(prediction)
Now the output is nowhere near to my audio sample. What do i have to do to increase it's accuracy?

I assume it's because you are not including any LanguageModel.
The pre-trained model is basically just the acoustic model which will only transcribe the audio to similar sounding text that may not make sense.
If you combine the acoustic model with a language model (LM) you will likely get better results.
In your code example I can see the Parameter LM_WEIGHT but not any refenrence to the LM itself.
I'm unsure in which Language you want to integrate deepspeech but here is the example for node-js. This is the part where the LM is integrated
const LM_ALPHA = 0.75;
const LM_BETA = 1.85;
let lmPath = './models/lm.binary';
let triePath = './models/trie';
model.enableDecoderWithLM(lmPath, triePath, LM_ALPHA, LM_BETA);
If I'm not mistaken, the LM & Trie file is included in the pre-trained download ZIP
wget https://github.com/mozilla/DeepSpeech/releases/download/v0.5.1/deepspeech-0.5.1-models.tar.gz
Otherwise you can also create your own Language Model which would make sense if you only need the Model to recognize specific words.

Related

How to generate sentence embedding using long-former model

I am using Hugging Face mrm8488/longformer-base-4096-finetuned-squadv2 pre-trained model
https://huggingface.co/mrm8488/longformer-base-4096-finetuned-squadv2.
I want to generate sentence level embedding. I have a data-frame which has a text column.
I am using this code:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForQuestionAnswering.from_pretrained(ckpt)
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." # I will pas text-column here from my data-frame
#question = "What has Huggingface done ?"
encoding = tokenizer(question, text, return_tensors="pt")
# I don't want to use it for Question-Answer use-case. I just need the sentence embeddings
input_ids = encoding["input_ids"]
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
attention_mask = encoding["attention_mask"]
How can I do modification in the above code to generate embedding for sentences. ?
I have the following examples:
Text
i've added notes to the claim and it's been escalated for final review
after submitting the request you'll receive an email confirming the open request.
hello my name is person and i'll be assisting you
this is sam and i'll be assisting you for date.
I'll return the amount as asap.
ill return it to you.
The Longformer uses a local attention mechanism and you need to pass a global attention mask to let one token attend to all tokens of your sequence.
import torch
from transformers import LongformerTokenizer, LongformerModel
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = LongformerTokenizer.from_pretrained(ckpt)
model = LongformerModel.from_pretrained(ckpt)
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." # I will pas text-column here from my data-frame
#question = "What has Huggingface done ?"
encoding = tokenizer(text, return_tensors="pt")
global_attention_mask = [1].extend([0]*encoding["input_ids"].shape[-1])
encoding["global_attention_mask"] = global_attention_mask
# I don't want to use it for Question-Answer use-case. I just need the sentence embeddings
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
o = model(**encoding)
sentence_embedding = o.last_hidden_state[:,0]
You should keep in mind that mrm8488/longformer-base-4096-finetuned-squadv2 was not pre-trained to produce meaningful sentence embeddings and faces the same issues as the MLM pre-trained BERT's regarding sentence embeddings.

How can I add the decode_batch_predictions() method into the Keras Captcha OCR model?

The current Keras Captcha OCR model returns a CTC encoded output, which requires decoding after inference.
To decode this, one needs to run a decoding utility function after inference as a separate step.
preds = prediction_model.predict(batch_images)
pred_texts = decode_batch_predictions(preds)
The decoded utility function uses keras.backend.ctc_decode, which in turn uses either a greedy or beam search decoder.
# A utility function to decode the output of the network
def decode_batch_predictions(pred):
input_len = np.ones(pred.shape[0]) * pred.shape[1]
# Use greedy search. For complex tasks, you can use beam search
results = keras.backend.ctc_decode(pred, input_length=input_len, greedy=True)[0][0][
:, :max_length
]
# Iterate over the results and get back the text
output_text = []
for res in results:
res = tf.strings.reduce_join(num_to_char(res)).numpy().decode("utf-8")
output_text.append(res)
return output_text
I would like to train a Captcha OCR model using Keras that returns the CTC decoded as an output, without requiring an additional decoding step after inference.
How would I achieve this?
The most robust way to achieve this is by adding a method which is called as part of the model definition:
def CTCDecoder():
def decoder(y_pred):
input_shape = tf.keras.backend.shape(y_pred)
input_length = tf.ones(shape=input_shape[0]) * tf.keras.backend.cast(
input_shape[1], 'float32')
unpadded = tf.keras.backend.ctc_decode(y_pred, input_length)[0][0]
unpadded_shape = tf.keras.backend.shape(unpadded)
padded = tf.pad(unpadded,
paddings=[[0, 0], [0, input_shape[1] - unpadded_shape[1]]],
constant_values=-1)
return padded
return tf.keras.layers.Lambda(decoder, name='decode')
Then defining the model as follows:
prediction_model = keras.models.Model(inputs=inputs, outputs=CTCDecoder()(model.output))
Credit goes to tulasiram58827.
This implementation supports exporting to TFLite, but only float32. Quantized (int8) TFLite export is still throwing an error, and is an open ticket with TF team.
Your question can be interpreted in two ways. One is: I want a neural network that solves a problem where the CTC decoding step is already inside what the network learned. The other one is that you want to have a Model class that does this CTC decoding inside of it, without using an external, functional function.
I don't know the answer to the first question. And I cannot even tell if it's feasible or not. In any case, sounds like a difficult theoretical problem and if you don't have luck here, you might want to try posting it in datascience.stackexchange.com, which is a more theory-oriented community.
Now, if what you are trying to solve is the second, engineering version of the problem, that's something I can help you with. The solution for that problem is the following:
You need to subclass keras.models.Model with a class with the method you want. I went over the tutorial in the link you posted and came with the following class:
class ModifiedModel(keras.models.Model):
# A utility function to decode the output of the network
def decode_batch_predictions(self, pred):
input_len = np.ones(pred.shape[0]) * pred.shape[1]
# Use greedy search. For complex tasks, you can use beam search
results = keras.backend.ctc_decode(pred, input_length=input_len, greedy=True)[0][0][
:, :max_length
]
# Iterate over the results and get back the text
output_text = []
for res in results:
res = tf.strings.reduce_join(num_to_char(res)).numpy().decode("utf-8")
output_text.append(res)
return output_text
def predict_texts(self, batch_images):
preds = self.predict(batch_images)
return self.decode_batch_predictions(preds)
You can give it the name you want, it's just for illustration purposes.
With this class defined, you would replace the line
# Get the prediction model by extracting layers till the output layer
prediction_model = keras.models.Model(
model.get_layer(name="image").input, model.get_layer(name="dense2").output
)
with
prediction_model = ModifiedModel(
model.get_layer(name="image").input, model.get_layer(name="dense2").output
)
And then you can replace the lines
preds = prediction_model.predict(batch_images)
pred_texts = decode_batch_predictions(preds)
with
pred_texts = prediction_model.predict_texts(batch_images)

How do I make a paraphrase generation using BERT/ GPT-2

I am trying hard to understand how to make a paraphrase generation using BERT/GPT-2. I cannot understand how do I make it. Could you please provide me with any resources where I will be able to make a paraphrase generation model?
"The input would be a sentence and the output would be a paraphrase of the sentence"
Here is my recipe for training a paraphraser:
Instead of BERT (encoder only) or GPT (decoder only) use a seq2seq model with both encoder and decoder, such as T5, BART, or Pegasus. I suggest using the multilingual T5 model that was pretrained for 101 languages. If you want to load embeddings for your own language (instead of using all 101), you can follow this recipe.
Find a corpus of paraphrases for your language and domain. For English, ParaNMT, PAWS, and QQP are good candidates. A corpus called Tapaco, extracted from Tatoeba, is a paraphrasing corpus that covers 73 languages, so it is a good starting point if you cannot find a paraphrase corpus for your language.
Fine-tune your model on this corpus. The code can be something like this:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
# use here a backbone model of your choice, e.g. google/mt5-base
backbone_model = 'cointegrated/rut5-base-multitask'
model = T5ForConditionalGeneration.from_pretrained(backbone_model)
tokenizer = T5Tokenizer.from_pretrained(backbone_model)
model.cuda();
optimizer = torch.optim.Adam(params=[p for p in model.parameters() if p.requires_grad], lr=1e-5)
# todo: load the paraphrasing corpus and define the get_batch function
for i in range(100500):
xx, yy = get_batch(mult=mult)
x = tokenizer(xx, return_tensors='pt', padding=True).to(model.device)
y = tokenizer(yy, return_tensors='pt', padding=True).to(model.device)
# do not force the model to predict pad tokens
y.input_ids[y.input_ids==0] = -100
loss = model(
input_ids=x.input_ids,
attention_mask=x.attention_mask,
labels=y.input_ids,
decoder_attention_mask=y.attention_mask,
return_dict=True
).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
model.save_pretrained('my_paraphraser')
tokenizer.save_pretrained('my_paraphraser')
A more complete version of this code can be found in this notebook.
After the training, the model can be used in the following way:
from transformers import pipeline
pipe = pipeline(task='text2text-generation', model='my_paraphraser')
print(pipe('Here is your text'))
# [{'generated_text': 'Here is the paraphrase or your text.'}]
If you want your paraphrases to be more diverse, you can control the generation procress using arguments like
print(pipe(
'Here is your text',
encoder_no_repeat_ngram_size=3, # make output different from input
do_sample=True, # randomize
num_beams=5, # try more options
max_length=128, # longer texts
))
Enjoy!
you can use T5 paraphrasing for generating paraphrases

Extract object features from Detectron2

I am really new to object detection, so sorry if this question seems too obvious.
I have a trained FasterRCNN model on detectron2 that detects objects, and I am trying to extract the features of each detected object for the output of my model's prediction. The tutorials available seems to look for ROI and make new predictions, along with their boxes and features. I have the boxes from inference, I just need to extract the features of each box. I added the code I've been working with below. Thank you
# Preprocessing
images = predictor.model.preprocess_image(inputs) # don't forget to preprocess
# Run Backbone Res1-Res4
features = predictor.model.backbone(images.tensor) # set of cnn features
#get proposed boxes + rois + features + predictions
# Run RoI head for each proposal (RoI Pooling + Res5)
proposal_boxes = [x.proposal_boxes for x in proposals]
features = [features[f] for f in predictor.model.roi_heads.in_features]
proposal_rois = predictor.model.roi_heads.box_pooler(features, proposal_boxes)
box_features = predictor.model.roi_heads.box_head(proposal_rois)
predictions = predictor.model.roi_heads.box_predictor(box_features)#found here: https://detectron2.readthedocs.io/_modules/detectron2/modeling/roi_heads/roi_heads.html
pred_instances, pred_inds = predictor.model.roi_heads.box_predictor.inference(predictions, proposals)
pred_instances = predictor.model.roi_heads.forward_with_given_boxes(features, pred_instances)
# output boxes, masks, scores, etc
pred_instances = predictor.model._postprocess(pred_instances, inputs, images.image_sizes) # scale box to orig size
# features of the proposed boxes
feats = box_features[pred_inds]
proposal_boxes = proposals[0].proposal_boxes[pred_inds]
This question has already been well discussed in this issue: https://github.com/facebookresearch/detectron2/issues/5.
The documentation also explains how to achieve this.

Natural Language Processing Model

I'm a beginner in NLP and making a project to parse, and understand the intentions of input lines by a user in english.
Here is what I think I should do:
Create a text of sentences with POS tagging & marked intentions for every sentence by hand.
Create a model say: decision tree and train it on the above sentences.
Try the model on user input:
Do basic tokenizing and POS tagging on user input sentence and testing it on the above model for knowing the intention of this sentence.
It all may be completely wrong or silly but I'm determined to learn how to do it. I don't want to use ready-made solutions and the programming language is not a concern.
How would you guys do this task? Which model to choose and why? Normally to make NLP parsers, what steps are done.
Thanks
I would use NLTK.
There is an online book with a chapter on tagging, and a chapter on parsing. They also provide models in python.
Here is a simple example based on NLTK and Bayes
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)),category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)
]
random.shuffle(documents)
all_words = [w.lower() for w in movie_reviews.words()]
for w in movie_reviews.words():
all_words.append(w.lower())git b
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
print((find_features(movie_reviews.words("neg/cv000_29416.txt"))))
featuresets = [(find_features(rev),category) for (rev,category) in documents ]
training_set =featuresets[:10]
testing_set = featuresets[1900:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo Accuracy: ",(nltk.classify.accuracy(classifier,testing_set))* 100 )

Resources